Changeset 1365


Ignore:
Timestamp:
Aug 24, 2011, 1:55:27 PM (8 years ago)
Author:
ashriram
Message:

Fixed methodology

Location:
docs/HPCA2012
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/04-methodology.tex

    r1362 r1365  
    1 \section{Methodology}
     1\section{Evaluation Framework}
    22\label{section:methodology}
    33
    4 In this section we describe our methodology for the measurements and
    5 investigation of XML parser energy consumption and performance.  In
    6 brief, for each of the four XML parsers under study we propose to measure
    7 and evaluate the energy consumption required to carry out XML
    8 well-formedness checking, under a variety of workloads, and as
    9 executed on three different Intel processors.
     4\paragraph{XML Parsers}\label{parsers}
    105
    11 To begin our study we propose to first investigate each of the XML
    12 parsers in terms of the Performance Monitoring Counter (PMC) hardware
    13 events listed in the PMC Hardware Events subsection. Based on the
    14 findings of previous work \cite{bellosa2001, bertran2010, bircher2007}
    15 we have chosen several key hardware performance events for which the
    16 authors indicate a strong correlation with overall performance and
    17 energy consumption of the application. In addition, we measure the
    18 runtime counts of SIMD instructions and bitwise operations using the
    19 Intel Pin binary instrumentation framework. Based on these data we
    20 gain further insight into XML parser execution characteristics and
    21 compare and constrast each of the Parabix parser versions against the
    22 performance of standard industry parsers.
    23 
    24 The foundational work by Bellosa in \cite{bellosa2001} as well as more
    25 recent work in \cite {bircher2007, bertran2010} demonstrate that
    26 hardware-usage patterns have a significant impact on the energy
    27 consumption characteristics of an application \cite{bellosa2001,
    28   bircher2007, bertran2010}. Further, the authors demonstrate a strong
    29 correlation between specific PMC events and energy usage. However, each
    30 author differs slightly in their opinion of the exact set of PMCs to use.
    31 
    32 The following subsections describe the XML parsers under study, XML
    33 workloads, the hardware architectures, PMC hardware events selected
    34 for measurement, and the energy measurement instrumentation set up. We analyze the
    35 performance of each of the XML parsers under study based on PMC hardware event counts and contrast their energy consumption
    36 measurements based on direct measurements.
     6In our evaluation we evaluate Parabix against two widely available
     7software parsers.  Xerces-C++, and Expat XML parsers. Parabix is our
     8open-sourced XML parser that leverages Parallel Bit Stream technology
     9and the SIMD capabilities of modern commodity processors.  Xerces-C++
     10version 3.1.1 (SAX) \cite{xerces} is a validating open source XML
     11parser written in C++ available as part of the the Apache project.
     12Expat version 2.0.1 \cite{expat} is a non-validating XML parser
     13library written in C.
    3714
    3815
    39 \subsection{Parsers}\label{parsers}
     16\paragraph{XML Workloads}\label{workloads}
     17XML is used for a variety of purposes ranging from databases to config
     18files in mobile phones. A key feature of these XML files that affects
     19the overall parsing performance is the \textit{Markup
     20  density}. \textit{Markup density} is defined as the ratio of the
     21total markup contained within an XML file to the total XML document
     22size.  This metric has substantial influence on the performance of
     23traditional recursive descent XML parser implementations.  We use a
     24mixture of document-oriented and data-oriented XML files in our study
     25to provide workloads with a full spectrum of markup densities.
    4026
    41 The XML parsing technologies selected for this study are the Parabix1,
    42 Parabix2, Xerces-C++, and Expat XML parsers. Parabix1 (parallel bit
    43 Streams for XML) is our first generation SIMD and Parallel Bit Stream
    44 technology based XML parser \cite{Parabix1}.  Parabix1 leverages the
    45 processor built-in {\em bitscan} operation for high-performance XML
    46 character scanning as well as the SIMD capabilities of modern
    47 commodity processors to achieve high performance.  Parabix2
    48 \cite{parabix2} represents the second generation of the Parabix1
    49 parser. Parabix2 is an open-source XML parser that also leverages
    50 Parallel Bit Stream technology and the SIMD capabilities of modern
    51 commodity processors. However, Parabix2 differs from Parabix1 in that
    52 it employs new parallelization techniques, such as a multiple cursor
    53 approach to parallel parsing together with bit stream addition
    54 techniques to advance multiple cursors independently and in
    55 parallel. Parabix2 delivers dramatic performance improvements over
    56 traditional byte-at-a-time parsing technology.  Xerces-C++ version
    57 3.1.1 (SAX) \cite{xerces} is a validating open source XML parser
    58 written in C++ by the Apache project.  Expat version 2.0.1
    59 \cite{expat} is a non-validating XML parser library written in C.
     27Table \ref{XMLDocChars} shows the document characteristics of the XML
     28input files selected for this performance study.  The jawiki.xml and
     29dewiki.xml XML files represent document-oriented XML inputs and
     30contain the three-byte and four-byte UTF-8 sequence required for the
     31UTF-8 encoding of Japanese and German characters respectively.  The
     32remaining data files are data-oriented XML documents and consist
     33entirely of single byte $7$-bit encoded ASCII characters.
    6034
    6135\begin{table*}
     
    7751\end{table*}
    7852
    79 \subsection{Workloads}\label{workloads}
    8053
    81 Markup density is defined as the ratio of the total markup contained
    82 within an XML file to the total XML document size.  This metric has
    83 substantial influence on the performance of traditional recursive
    84 descent XML parser implementations.  We use a mixture of
    85 document-oriented and data-oriented XML files in our study to provide
    86 workloads with a full spectrum of markup densities.
     54\paragraph{Platform Hardware}
     55SSE extensions have been available on commodity Intel processors for
     56over a decade since the Pentium III. They have steadily evolved with
     57improvements in instruction latency, cache interface, and register
     58resources, and the addition domain specific instructions. Here we
     59investigate SIMD extensions across three different generations of
     60intel processors. Table \ref{hwinfo} describes the Intel multicores we
     61investigate. We compare the energy and performance profile of the
     62Parabix under the platforms.  We also analyze the implementation
     63specifics of SIMD extensions under various microarchitecture. We we
     64evalute both the legacy SSE and newer AVX extensions supported by
     65Sandybridge.
    8766
    88 Table \ref{XMLDocChars} shows the document characteristics of the XML
    89 input files selected for this performance study.  The jawiki.xml and
    90 dewiki.xml XML files represent document-oriented XML inputs and
    91 contain the three-byte and four-byte UTF-8 sequence required for the
    92 UTF-8 encoding of Japanese and German characters respectively.  The
    93 remaining data files are data-oriented XML documents and consist
    94 entirely of single byte $7$-bit encoded ASCII characters.
     67We propose to investigate each the execution profiles of XML parsers
     68using the the Performance Monitoring Counter (PMC) hardware event
     69found in the processor. We have chosen several key hardware
     70performance events which provide insight into the profile of our
     71application and indicate if the processor is doing useful
     72work~\cite{bellosa2001, bertran2010}.  The set of performance counters
     73included in our study are Branch instructions, Branch mispredictions,
     74Integer instructions, SIMD instructions, and Cache misses. In
     75addition, we characterize the SIMD operations and study the type and
     76class of SIMD operations using the Intel Pin binary instrumentation
     77framework.
    9578
    9679
    97 \subsection{Platform Hardware}
    98 \paragraph{Intel \CO{}}
    99 Intel \CO{} processor, code name Conroe, produced by
    100 Intel. Table \ref{core2info} gives the hardware description of the
    101 Intel \CO{} machine.
     80
     81
    10282
    10383\begin{table*}[h]
     
    11494\end{tabular}
    11595\caption{Platform Hardware Specs}
     96\label{hwinfo}
    11697\end{table*}
    11798
    118 Intel \CITHREE\ processor, code name Nehalem, produced by Intel. The
    119 intent of the selection of this processor is to serve as an example of a low end server
    120 processor. Table \ref{i3info} gives the hardware description of the
    121 Intel \CITHREE\ machine. Intel \CIFIVE\  processor, code name \SB\, produced by
    122 Intel. Table \ref{sandybridgeinfo} gives the hardware description of the
    123 Intel \CITHREE\ machine.
    124 Each of the hardware events selected relates to performance and energy
    125 features associated with one or more hardware units.  For example,
    126 total branch mispredictions relate to the branch predictor and branch
    127 target buffer capacity.
    12899
    129 The set of PMC events used included in this study are as follows.
    130 Processor Cycles, Branch Instructions, Branch Mispredictions, Integer
    131 Instructions, SIMD Instructions and Cache Misses.
    132100
    133 \subsection{Energy Measurement}
    134   We measure energy consumption using the Fluke i410 current
    135 clamp applied on the 12V wires that supply power to the processor
    136 sockets. The clamp detects the magnetic field created by the flowing
    137 current and converts it into voltage levels (1mV per 1A
    138 current). The voltage levels are then monitored by an Agilent 34410a
    139 multimeter at the granularity of 100 samples per second. This
    140 measurement captures the power to the processor package, including
    141 cores, caches, Northbridge memory controller, and the quick-path
    142 interconnects \cite{clamp}.
     101\paragraph{Energy Measurement}
     102
     103A key benefit of the Parabix parser is its more efficient use of the
     104processor pipeline which reflects in the overall energy usage.  We
     105measure the energy consumption of the processor directly using a
     106current clamp. We apply the Fluke i410 current clamp \cite{clamp} to the 12V wires
     107that supply power to the processor sockets. The clamp detects the
     108magnetic field created by the flowing current and converts it into
     109voltage levels (1mV per 1A current). The voltage levels are then
     110monitored by an Agilent 34410a digital multimeter at the granularity
     111of 100 samples per second. This measurement captures the instantaneous
     112power to the processor package, including cores, caches, northbridge
     113memory controller, and the quick-path interconnects. We obtain samples
     114throughout the entire execution of the program and then calculate overall
     115total energy as  $12V*\sigma^{N_{samples}}_{i=1} Sample_i$.
     116
     117
  • docs/HPCA2012/05-corei3.tex

    r1341 r1365  
    1 \section{Baseline Evaluation on \CITHREE{}}
     1\section{Efficiency of the Parabix}
    22\label{section:baseline}
     3
     4%[AS] Need to say what we are doing here.
     5
     6brief, for each of the four XML parsers under study we propose to measure
     7and evaluate the energy consumption required to carry out XML
     8well-formedness checking, under a variety of workloads, and as
     9executed on three different Intel processors.
     10
     11
     12
     13
    314%some of the numbers are roughly calculated, needs to be recalculated for final version
    415\subsection{Cache behavior}
    5 \CITHREE\ has a three level cache hierarchy.  The approximate miss
     16\CITHREE\ cache hierarchy.  The approximate miss
    617penalty for each cache level is 4, 11, and 36 cycles respectively.
    718Figure \ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
  • docs/HPCA2012/07-avx.tex

    r1361 r1365  
    1 \section{Scaling Parabix2 for AVX}
     1\section{Scaling Parabix for AVX}
    22\label{section:avx}
    3 In this section, we discuss the scalability and performance advantages of our 256-bit AVX (Advanced Vector Extensions) Parabix2 port.
    4 Parabix2 originally targetted the 128-bit SSE2 SIMD technology available on all modern 64-bit Intel and AMD processors but
    5 has recently been ported to AVX. AVX technology is commercially available on the
    6 latest the \SB\ microarchitecture Intel processors.
     3In this section, we discuss the scalability and performance advantages
     4of our 256-bit AVX (Advanced Vector Extensions) Parabix XML port.  The
     5Parabix SIMD libraries originally targetted the 128-bit SSE2 SIMD
     6technology available on all modern 64-bit Intel and AMD processors but
     7has recently been ported to AVX. AVX technology is commercially
     8available on the latest the \SB\ microarchitecture Intel
     9processors. While we have to port our runtime framework the
     10application didn't need to be modified.
    711
    812\begin{figure*}
     
    1014\includegraphics[height=0.25\textheight]{plots/InsMix.pdf}
    1115\end{center}
    12 \caption{Parabix2 Instruction Counts (y-axis: Instructions per kB)}
     16\caption{Parabix Instruction Counts (y-axis: Instructions per kB)}
    1317\label{insmix}
    1418\end{figure*}
     
    1822\includegraphics[width=0.5\textwidth]{plots/avx.pdf}
    1923\end{center}
    20 \caption{Parabix2 Performance (y-axis: ns per kB)}
     24\caption{Parabix Performance (y-axis: ns per kB)}
    2125\label{avx}
    2226\end{figure}
    2327
    24 \subsection{Three Operand Form}
     28\paragraph{3-Operand Form}
     29In addition to the widening of 128-bit operations to 256-bit
     30operations, AVX technology uses a nondestructive 3-operand instruction
     31format. Previous SSE implementations used a destructive 2-operand
     32instruction format. In the 2-operand format a single register is used
     33as both a source and destination register. For example, $a =
     34a~\texttt{[op]}~b$.  As such, 2-operand instructions that require the
     35value of both $a$ and $b$, must either copy an additional register
     36value beforehand, or reconstitute or reload a register value
     37afterwards to recover the value.  With the 3-operand format, output
     38may now be directed to the third register independently of the source
     39operands. For example, $c = a~\texttt{[op]}~b$.  By avoiding the
     40copying or reconstituting of operand values, a considerable reduction
     41in instructions required for unloading from and loading into
     42registers.  AVX technology makes available the 3-operand form for both
     43the new 256-bit operations as well as the base 128-bit SSE operations.
    2544
    26 In addition to the widening of 128-bit operations to 256-bit operations, AVX technology
    27 uses a nondestructive 3-operand instruction format. Previous SSE implementations
    28 used a destructive 2-operand instruction format. In the 2-operand format
    29 a single register is used as both a source and
    30 destination register. For example, $a = a~\texttt{[op]}~b$.
    31 As such, 2-operand instructions that require the value of both $a$ and $b$,
    32 must either copy an additional register value beforehand, or reconstitute or reload a register value
    33 afterwards to recover the value.
    34 With the 3-operand format, output may now be directed to the third register independently
    35 of the source operands. For example, $c = a~\texttt{[op]}~b$.
    36 By avoiding the copying or reconstituting of operand values, a considerable
    37 reduction in instruction count in the form of reduced load and store instructions is possible.
    38 AVX technology makes available the 3-operand form for both the new 256-bit
    39 operations as well as the base 128-bit SSE operations.
    40 
    41 \subsection{256-bit AVX Operations}
    42 
     45\subsection{256-bit Operations}
    4346With the introduction of 256-bit SIMD registers, and under ideal
    4447conditions, one would anticipate a corresponding 50\% reduction in the
    45 SIMD instruction count of Parabix2 on AVX.  However, in the \SB\ AVX
     48SIMD instruction count of Parabix on AVX.  However, in the \SB\ AVX
    4649implementation, Intel has focused primarily on floating point
    4750operations as opposed to the integer based operations.  256-bit SIMD
    4851is available for loads, stores, bitwise logic and floating operations,
    4952whereas SIMD integer operations and shifts are only available in the
    50 128-bit form.  Nevertheless, with loads, stores and bitwise logic
    51 comprising a major portion of the Parabix2 SIMD instruction mix, a
    52 substantial reduction in instruction count and consequent performance
    53 improvement was anticipated but not achieved.
     53128-bit form.
     54
    5455
    5556\subsection{Performance Results}
    5657
    57 We implemented two versions of Parabix2 using AVX technology.  The first
    58 was simply the recompilation of the existing Parabix2 source code
    59 written to take advantage of the 3-operand form of AVX instructions while retaining
    60 a uniform 128-bit SIMD processing width.  The second involved rewriting the
    61 core library functions of Parabix2 to leverage the 256-bit AVX operations wherever
    62 possible and to simulate the remaining operations using pairs of 128-bit
    63 operations.   
    64 
    65 Figure \ref{insmix} shows the reduction in instruction
    66 counts achieved in these two versions.   For each workload, the
    67 base instruction count of the Parabix2 binary compiled in SSE-only
    68 mode is shown with the caption ``sse,'' the version obtained by
    69 simple recompilation with AVX-mode enabled is labeled ``128-bit avx,''
    70 and the version reimplemented to use 256-bit operations wherever
    71 possible is labelled ``256-bit avx.''    The instruction counts
    72 are divided into three classes.  The ``non-SIMD'' operations
    73 are the general purpose instructions that use neither SSE nor
    74 AVX technology.   The ``bitwise SIMD'' class comprises
    75 the bitwise logic operations, that are available in both 128-bit
    76 form and 256-bit form.  The ``other SIMD'' class comprises
    77 all other SIMD operations, primarily comprising the integer SIMD
    78 operations that are available only at 128-bit widths even with
    79 256-bit AVX technology.
     58We implemented two versions of Parabix using AVX technology.  The
     59first was simply the recompilation of the existing Parabix source code
     60written to take advantage of the 3-operand form of AVX instructions
     61while retaining a uniform 128-bit SIMD processing width.  The second
     62involved rewriting the internal library functions of Parabix to
     63leverage the 256-bit AVX operations wherever possible and to simulate
     64the remaining operations using pairs of 128-bit operations.Figure
     65\ref{insmix} shows the reduction in instruction counts achieved in
     66these two versions.  For each workload, the base instruction count of
     67the Parabix binary compiled in SSE-only mode is indicated by ``sse,''
     68the version which only takes advantage of the AVX 3-operand mode is
     69labeled ``128-bit avx,'' and the version reimplemented to use 256-bit
     70operations wherever possible is labelled ``256-bit avx.''  The
     71instruction counts are divided into three classes: ``non-SIMD''
     72operations are the general purpose instructions.  The ``bitwise SIMD''
     73class comprises the bitwise logic operations, that are available in
     74both 128-bit form and 256-bit form.  The ``other SIMD'' class
     75comprises all other SIMD operations, primarily comprising the integer
     76SIMD operations that are available only at 128-bit widths even under
     77AVX.
    8078
    8179Note that, in each workload, the number of non-SIMD instructions
    82 remains relatively constant with each workload.  As may be expected,
    83 however, the number of ``bitwise SIMD'' operations remains the same
     80remains relatively constant with each workload.  As may be expected
     81the number of \textit{bit-parallel SIMD} operations remains the same
    8482for both SSE and 128-bit while dropping dramatically when operating
    85 256-bits at a time.  Ideally one one may expect up to a 50\% reduction
    86 in these instructions versus the 128-bit AVX.  The actual reduction
    87 measured was 32\%--39\% depending on workload.  Because some bitwise
    88 logic is needed in implementation of simulated 256-bit operations, the
    89 full 50\% reduction in bitwise logic was not achieved.
     83256-bits at a time.  The reduction measured was 32\%--39\% depending
     84on workload because some bitwise logic needed in implementation is
     85composed of 128-bit operations. The limits the performance gains
     86achieved when using the AVX instructions.  The ``other SIMD'' class
     87shows a substantial 30\%-35\% reduction with AVX 128-bit technology
     88compared to SSE.  This reduction is due to elimination of register
     89unloading and reloading when SIMD operations are compiled using
     903-operand AVX form versus 2-operand SSE form.  A further 10\%--20\%
     91reduction is observed with Parabix version rewritten to use 256-bit
     92operations.
    9093
    91 The ``other SIMD'' class shows a substantial 30\%-35\% reduction
    92 with AVX 128-bit technology compared to SSE.  This reduction is
    93 due to eliminated copies or reloads when SIMD operations
    94 are compiled using 3-operand AVX form versus 2-operand SSE form.
    95 A further 10\%--20\% reduction is observed with Parabix2 version
    96 rewritten to use 256-bit operations. 
    97 
    98 While the successive reductions in SIMD instruction counts are quite
    99 dramatic with the two AVX implementations of Parabix2, the performance
    100 benefits are another story.  As shown in Figure \ref{avx}, the
    101 benefits of the reduced SIMD instruction count are achieved only in
    102 the AVX 128-bit version.  In this case, the benefits of 3-operand form
    103 seem to fully translate to performance benefits.  Based on the
    104 reduction of overall Bitwise-SIMD instructions we expected a 11\%
    105 improvement in performance.  Instead, perhaps bizzarely, the
    106 performance of Parabix2 in the 256-bit AVX implementation does not
    107 improve significantly and actually degrades for files with higher
    108 markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD
    109 instructions reduced by 39\%, saw a performance improvement of 8\%.
    110 We believe that this is primarily due to the intricacies of the first
    111 generation AVX implemention in \SB{}, with significant latency in many
    112 of the 256-bit instructions in comparison to their 128-bit
    113 counterparts. The 256-bit instructions also have different scheduling
    114 constraints that seem to reduce overall SIMD throughput.  If these
    115 latency issues can be addressed in future AVX implementations, further
    116 substantial performance and energy benefits could be realized in XML
    117 parsing with Parabix2.
     94%[AS] Check numbers.
     95The reductions in instruction counts are quite dramatic with the AVX
     96extensions in Parabix demonstrating the ability of our runtime
     97framework to exploit the available hardware resources. As shown in
     98Figure \ref{avx}, the benefits of the reduced SIMD instruction count
     99are achieved only in the AVX 128-bit version.  In this case, the
     100benefits of 3-operand form seem to fully translate to performance
     101benefits.  Based on the reduction of overall Bitwise-SIMD instructions
     102we expected a 11\% improvement in performance.  Instead, perhaps
     103bizzarely, the performance of Parabix in the 256-bit AVX
     104implementation does not improve significantly and actually degrades
     105for files with higher markup density (average 11\%). Dewiki.xml, on
     106which bitwise-SIMD instructions reduced by 39\%, saw a performance
     107improvement of 8\%.  We believe that this is primarily due to the
     108intricacies of the first generation AVX implemention in \SB{}, with
     109significant latency in many of the 256-bit instructions in comparison
     110to their 128-bit counterparts. The 256-bit instructions also have
     111different scheduling constraints that seem to reduce overall
     112throughput.  If these latency issues can be addressed in future AVX
     113implementations, further substantial performance and energy benefits
     114could be realized in XML parsing with Parabix.
Note: See TracChangeset for help on using the changeset viewer.