Changeset 1001 for docs


Ignore:
Timestamp:
Mar 25, 2011, 4:00:24 PM (9 years ago)
Author:
lindanl
Message:

section 4 and 5

Location:
docs/PACT2011
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • docs/PACT2011/04-methodology.tex

    r996 r1001  
    22
    33
    4 In this section, we describe our methodology for the measurements and investigation of XML parsing energy consumption and performance.
    5 In brief, for each of the XML parsers under study we propose to measure and evaluate the energy consumption required to carry out XML well-formedness checking,
    6 under a variety of workloads, and as executed on three different Intel cores.
     4In this section, we describe our methodology for the measurements and
     5investigation of XML parsing energy consumption and performance.  In
     6brief, for each of the XML parsers under study we propose to measure
     7and evaluate the energy consumption required to carry out XML
     8well-formedness checking, under a variety of workloads, and as
     9executed on three different Intel cores.
    710
    8 To begin our study, we propose to first investigate each of the XML parsers in terms of the PMCs hardware events as listed in the following subsection.
    9 Based on previous key works \cite{bellosa2001, bertran2010, bircher2007},
    10 we have chosen several key hardware performance events for which the authors indicate have a strong correlation to energy consumption.
    11 From these data, we hope to gain insight into the XML parser execution characteristics which most significantly contribute to overall energy consumption.
    12 Secondly, using the Fluke i410 current clamp meter, we plan to measure the total energy consumption required to complete XML well-formedness checking for each XML parser,
    13 on each hardware platform, and for each of a number of XML source files.
     11To begin our study, we propose to first investigate each of the XML
     12parsers in terms of the PMCs hardware events as listed in the
     13following subsection. Based on the recommendation of previous
     14proposals \cite{bellosa2001, bertran2010, bircher2007}, we have chosen
     15several key hardware performance events for which the authors indicate
     16have a strong correlation to energy consumption.  We also measure
     17other runtime counts such as the number of SIMD instructions and
     18bitwise operations using the PIN binary instrumentation
     19framework. From these data, we hope to gain insight into the XML
     20parser execution characteristics and compare and constrast different
     21industrial parsers.
    1422
    15 The foundational work by Bellosa in \cite{bellosa2001} as well as more recent work in \cite {bircher2007, bertran2010}
    16 show that hardware-usage patterns has a significant impact in the energy consumption of a particular application;
    17 \cite{bellosa2001, bircher2007, bertran2010} further show that there is a strong correlation between
    18 specific performance events and energy usage---but the authors of each differ slightly in opinion as to
    19 which performance monitoring counters\footnote{Performance monitoring counters (PMCs) are special-purpose registers that are included in most modern microprocessors;
    20 they store the running count of specific hardware events, such as retired instructions, cache misses, branch mispredictions, and arithmetic-logic unit operations to name a few.
    21 They can be used to capture information about any program at run-time, under any workload, at a very fine granularity.} (PMCs) to use.
     23The foundational work by Bellosa in \cite{bellosa2001} as well as more
     24recent work in \cite {bircher2007, bertran2010} show that
     25hardware-usage patterns has a significant impact in the energy
     26consumption of a particular application; \cite{bellosa2001,
     27  bircher2007, bertran2010} further show that there is a strong
     28correlation between specific performance events and energy usage---but
     29the authors of each differ slightly in opinion as to which performance
     30monitoring counters\footnote{Performance monitoring counters (PMCs)
     31  are special-purpose registers that are included in most modern
     32  microprocessors; they store the running count of specific hardware
     33  events, such as retired instructions, cache misses, branch
     34  mispredictions, and arithmetic-logic unit operations to name a few.
     35  They can be used to capture information about any program at
     36  run-time, under any workload, at a very fine granularity.} (PMCs) to
     37use.
    2238
    2339
    24 The following subsections describe the XML parsers under study, XML workloads, the hardware architectures, PMC hardware events selected for measurement, and the Fluke i401 current clamp meter.
    25 The expected outcomes of this section are hardware performance counter measurements and total energy consumption measurements for each of XML parser, XML source file, and hardware combination.
     40The following subsections describe the XML parsers under study, XML
     41workloads, the hardware architectures, PMC hardware events selected
     42for measurement, and the energy measurement set up. We analyze the
     43performance of the different parsers based on the hardware performance
     44counter measurements and contrast their energy consumption
     45measurements based on direct measurement.
     46
    2647
    2748\subsection{Parsers}\label{parsers}
    2849
    29 The XML parsing technologies selected for this study are the Parabix2, Xerces-C++, and Expat XML parsers.
    30 Parabix2 \cite{parabix2} (parallel bit streams for XML) is the second generation Parabix parser. Parabix2 is an open-source XML parser that leverages the SIMD capabilities of modern commodity processors;
    31 it employs the new parallelization techniques using parallel parsing with bit stream addition to deliver dramatic performance improvements over traditional byte-at-a-time parsing technology.
    32 Xerces-C++ version 3.1.1 (SAX) \cite{xerces} is a validating open source XML parser written in C++ by the Apache project.
    33 Expat version 2.0.1 \cite{expat} is a non-validating XML parser library written in C.
     50The XML parsing technologies selected for this study are the Parabix2,
     51Xerces-C++, and Expat XML parsers.  Parabix2 \cite{parabix2} (parallel
     52bit streams for XML) is the second generation Parabix parser. Parabix2
     53is an open-source XML parser that leverages the SIMD capabilities of
     54modern commodity processors; it employs the new parallelization
     55techniques using parallel parsing with bit stream addition to deliver
     56dramatic performance improvements over traditional byte-at-a-time
     57parsing technology.  Xerces-C++ version 3.1.1 (SAX) \cite{xerces} is a
     58validating open source XML parser written in C++ by the Apache
     59project.  Expat version 2.0.1 \cite{expat} is a non-validating XML
     60parser library written in C.
    3461
    3562\begin{table*}
     
    5077\subsection{Workloads}\label{workloads}
    5178
    52 Distinguishing between ``document-oriented'' XML and ``data-oriented'' XML is a popular way to describe the two basic classes of XML documents.
    53 Data-oriented XML is used as an interchange format.
    54 Document-oriented XML is used to impose structure on information that rarely fits neatly into a relational database--particularly information intended for publishing.
    55 Data-oriented XML are characterized by a higher markup density.
    56 Markup density is defined as the ratio of the total markup contained within an XML file to the total XML document size.
    57 This metric may have substantial influence on the performance of XML parsing.
    58 As such we choose workloads with a spectrum of markup densities.
     79Distinguishing between ``document-oriented'' XML and ``data-oriented''
     80XML is a popular way to describe the two basic classes of XML
     81documents.  Data-oriented XML is used as an interchange format.
     82Document-oriented XML is used to impose structure on information that
     83rarely fits neatly into a relational database--particularly
     84information intended for publishing.  Data-oriented XML are
     85characterized by a higher markup density.  Markup density is defined
     86as the ratio of the total markup contained within an XML file to the
     87total XML document size.  This metric may have substantial influence
     88on the performance of XML parsing.  As such we choose workloads with a
     89spectrum of markup densities.
    5990
    60 Table \ref{XMLDocChars} shows the document characteristics of the XML input files selected for this performance study.
    61 The jawiki.xml and dewiki.xml XML files represent document-oriented XML inputs, containing three-byte and four-byte UTF8 sequence.
    62 The remaining files are data-oriented inputs and consist of only ASCII characters.\cite{CameronHerdyLin2008}
     91Table \ref{XMLDocChars} shows the document characteristics of the XML
     92input files selected for this performance study.  The jawiki.xml and
     93dewiki.xml XML files represent document-oriented XML inputs,
     94containing three-byte and four-byte UTF8 sequence.  The remaining
     95files are data-oriented inputs and consist of only ASCII
     96characters.\cite{CameronHerdyLin2008}
    6397
    64 Describe parameters; what each parameter means.
     98
    6599\subsection{Platform Hardware}
    66 \subsubsection{Intel Core 2}
     100\paragraph{Intel Core 2}
    67101\begin{table}[h]
    68102\begin{center}
     
    81115\label{core2info}
    82116\end{table}
    83 \subsubsection{Intel Core i3}
    84 The Intel Core i3 is a Nehalem based processor produced by Intel. The intent of this processor is to serve as a
    85 low end server processor. Table \ref{i3info} gives the hardware description of the Intel Core i3 based machine selected.
     117
     118\paragraph {Intel Core i3}
     119The Intel Core i3 is a Nehalem based processor produced by Intel. The
     120intent of this processor is to serve as an example of low end server
     121processor. Table \ref{i3info} gives the hardware description of the
     122Intel Core i3 based machine selected.
    86123
    87124\begin{table}[h]
     
    104141\end{table}
    105142
    106 \subsubsection{Sandy Bridge}
     143\paragraph{Sandy Bridge}
    107144
    108145\begin{table}[h]
     
    127164\subsection{PMC Hardware Events}\label{events}
    128165
    129 Each of the hardware events selected relates to the energy consumption due to one or more hardware units. For example, total branch miss predictions corresponds to the use of the branch misprediction unit.
     166Each of the hardware events selected relates to the energy consumption
     167due to one or more hardware units. For example, total branch miss
     168predictions corresponds to the use of the branch misprediction unit.
    130169
    131170Initial PMC hardware event set:
     
    139178\end{itemize}
    140179
    141 \subsection{Measurement Hardware}
    142 The Fluke i410 current clamp meter is an electrical tester that combines a voltmeter with a clamp type current meter.
    143 Like the multimeter, the clamp meter has transitioned through the analog period and into the digital era. Created primarily as a single purpose test tool for electricians,
    144 the Fluke i410 have incorporated more measurement functions and accuracy \cite{clamp}.
     180\subsection{Energy Measurement}
     181  To measure energy we use a Fluke i410 current
     182clamp applied on the 12V wires that supply power to the processor
     183sockets. The clamp detects the magnetic field created by the flowing
     184current and converts it into voltage lev- els (1mV per 1A
     185current). The voltage levels are then monitored by an Agilent 34410a
     186multimeter at the granu- larity of 100 samples per second. This
     187measurement cap- tures the power to the processor package, including
     188cores, caches, Northbridge memory controller, and the quick-path
     189interconnects. \cite{clamp}.
  • docs/PACT2011/05-corei3.tex

    r980 r1001  
    1 \section{Evaluation on Corei3}
     1\section{Baseline Evaluation on Corei3}
    22
    33%some of the numbers are roughly calculated, needs to be recalculated for final version
    44\subsection{Cache behavior}
    5 Core i3 has a three level cache hierarchy.
    6 The miss penalty for each level is about 4 cycles, 11 cycles, and 36 cycles.
    7 Figure \ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure \ref{corei3_L3TM} show the L1, L2 and L3 data cache misses of all the four parsers.
    8 Although XML parsing is not a memory intensive application,
    9 the cost of cache miss for Expat and Xerces can be about half cycle per byte while the performance of Parabix is hardly affected by cache misses.
    10 Cache miss isn't just a problem for performance but also energy consumption.
    11 L1 cache miss cost about 8.3nJ; L2 cache miss cost about 19nJ; L3 cache miss cost about 40nJ.
    12 With a 1GB input file, Expat would consume more than 0.6J and Xerces would consume 0.9J on cache miss.
     5Core i3 has a three level cache hierarchy.  The miss penalty for each
     6level is about 4 cycles, 11 cycles, and 36 cycles.  Figure
     7\ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
     8\ref{corei3_L3TM} show the L1, L2 and L3 data cache misses of all the
     9four parsers.  Although XML parsing is not a memory intensive
     10application, the cost of cache miss for Expat and Xerces can be about
     11half cycle per byte while the performance of Parabix is hardly
     12affected by cache misses.  Cache miss isn't just a problem for
     13performance but also energy consumption.  L1 cache miss cost about
     148.3nJ; L2 cache miss cost about 19nJ; L3 cache miss cost about 40nJ.
     15With a 1GB input file, Expat would consume more than 0.6J and Xerces
     16would consume 0.9J on cache misses alone.
    1317
    1418
     
    3842
    3943\subsection{Branch Mispredictions}
    40 Despite years of improvement, branch misprediction is still a significant bottleneck of performance.
    41 The penalty of a branch misprediction is generally more than 10 CPU cycles.
    42 As shown in Figure \ref{corei3_BM}, the cost of branch mispredictions for Expat can be more than 7 cycles per byte,
    43 which is as much as the processing time of Parabix2 on the same workload.
     44Despite years of improvement, branch misprediction is still a
     45significant bottleneck of performance.  The penalty of a branch
     46misprediction is generally more than 10 CPU cycles.  As shown in
     47Figure \ref{corei3_BM}, the cost of branch mispredictions for Expat
     48can be more than 7 cycles per byte, which is as much as the processing
     49time of Parabix2 on the same workload.
    4450
    45 Reducing the branch misprediction rate is difficult for text-based applications due to the variable-length nature of syntactic elements.
    46 Therefore, the alternative solution of reducing branches becomes more attractive.
    47 However, the traditional byte-at-a-time method of XML parsing usually involves large amount of inevitable branches.
    48 As shown in Figure \ref{corei3_BR}, Xerces can have an average of 13 branches for each byte it processed on the high markup density file.
    49 Parabix substantially eliminate the branches by using parallel bit streams.
    50 Parabix1 still have a few branches for each block of 128 bytes (SSE) due to the sequential scanning.
    51 But with the new parallel scanning technique, Parabix2 is essentially branch-free as shown in the Figure \ref{corei3_BR}.
    52 As a result, Parabix2 has much less dependencies on markup density of the workloads.
     51Reducing the branch misprediction rate is difficult for text-based
     52applications due to the variable-length nature of syntactic elements.
     53Therefore, the alternative solution of reducing branches becomes more
     54attractive.  However, the traditional byte-at-a-time method of XML
     55parsing usually involves large amount of inevitable branches.  As
     56shown in Figure \ref{corei3_BR}, Xerces can have an average of 13
     57branches for each byte it processed on the high markup density file.
     58Parabix substantially eliminate the branches by using parallel bit
     59streams.  Parabix1 still have a few branches for each block of 128
     60bytes (SSE) due to the sequential scanning.  But with the new parallel
     61scanning technique, Parabix2 is essentially branch-free as shown in
     62the Figure \ref{corei3_BR}.  As a result, Parabix2 has minimal
     63dependency on the markup density of the workloads.
    5364
    5465\begin{figure}
     
    7081\subsection{SIMD/Total Instructions}
    7182
    72 Parabix gains its performance by using parallel bitstreams, which are mostly generated and calculated by SIMD instructions.
    73 The ratio of executed SIMD instructions over total instructions indicates the amount of parallel processing we were able to achieve.
    74 We use Intel pin, a dynamic binary instrumentation tool, to gather instruction mix.
     83Parabix gains its performance by using parallel bitstreams, which are
     84mostly generated and calculated by SIMD instructions.  The ratio of
     85executed SIMD instructions over total instructions indicates the
     86amount of parallel processing we were able to achieve.  We use Intel
     87pin, a dynamic binary instrumentation tool, to gather instruction mix.
    7588Then we adds up all the vector instructions that have been executed.
    76 Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} show the percentage of SIMD instructions
    77 of Parabix1 and Parabix2 (Expat and Xerce do not use any SIMD instructions).
    78 For Parabix1, 18\% to 40\% of the executed instructions consists of SIMD instructions.
    79 By using bistream addition for parallel scanning, Parabix2 uses 60\% to 80\% SIMD instructions.
    80 Although the ratio decrease as the markup density increase for both Parabix1 and Parabix2,
    81 the decreasing rate of Parabix2 is much lower and thus
    82 the performance degradation caused by increasing markup density is smaller.
     89Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} show the
     90percentage of SIMD instructions of Parabix1 and Parabix2 (Expat and
     91Xerce do not use any SIMD instructions).  For Parabix1, 18\% to 40\%
     92of the executed instructions consists of SIMD instructions.  By using
     93bistream addition for parallel scanning, Parabix2 uses 60\% to 80\%
     94SIMD instructions.  Although the ratio decrease as the markup density
     95increase for both Parabix1 and Parabix2, the decreasing rate of
     96Parabix2 is much lower and thus the performance degradation caused by
     97increasing markup density is smaller.
    8398
    8499\begin{figure}
     
    100115\subsection{CPU Cycles}
    101116
    102 Figure \ref{corei3_TOT} shows the result of the overall performance evaluated as CPU cycles per thousands input bytes.
    103 Parabix1 is 1.5 to 2.5 times faster on document-oriented input and 2 to 3 times faster on data-oriented input compared with Expat and Xerces.
    104 Parabix2 is 2.5 to 4 times faster on document-oriented input and 4.5 to 7 times faster on data-oriented input.
    105 Traditional parsers can be dramatically slowed down by higher markup density while Parabix with parallel processing is less affected.
    106 The comparison is not entirely fair for Xerces that transcodes input into UTF-16, which typically takes several cycles per byte.
    107 However, transcoding using parallel bitstreams can be much faster and
    108 it takes less than a cycle per byte to transcode ASCII files such as road.gml, po.xml and soap.xml \cite{Cameron2008}.
     117Figure \ref{corei3_TOT} shows the result of the overall performance
     118evaluated as CPU cycles per thousands input bytes.  Parabix1 is 1.5 to
     1192.5 times faster on document-oriented input and 2 to 3 times faster on
     120data-oriented input compared with Expat and Xerces.  Parabix2 is 2.5
     121to 4 times faster on document-oriented input and 4.5 to 7 times faster
     122on data-oriented input.  Traditional parsers can be dramatically
     123slowed down by higher markup density while Parabix with parallel
     124processing is less affected.  The comparison is not entirely fair for
     125Xerces that transcodes input into UTF-16, which typically takes
     126several cycles per byte.  However, transcoding using parallel
     127bitstreams can be much faster and it takes less than a cycle per byte
     128to transcode ASCII files such as road.gml, po.xml and soap.xml
     129\cite{Cameron2008}.
    109130
    110131\begin{figure}
     
    118139\subsection{Power and Energy}
    119140There is a growing concern of power consumption and energy efficiency.
    120 Chip producers not only work on improving the performance but also have worked hard to develop power efficient chips.
    121 We studied the power and energy consumption of Parabix in comparison with Expat and Xerces on corei3.
    122 We use a clamp to measure the real current of CPU power supply line and a meter to sample and record the results every 10ms.
     141Chip producers not only work on improving the performance but also
     142have worked hard to develop power efficient chips.  We studied the
     143power and energy consumption of Parabix in comparison with Expat and
     144Xerces on corei3. 
    123145 
    124 Figure \ref{corei3_power} shows the average power consumed by the four different parsers.
    125 The average power of corei3-530 is about 21 watts.
    126 This model released by Intel last year has a good reputation for power efficiency.
    127 Parabix2 dominated by SIMD instructions uses only about 5\% higher power than the other parsers.
    128 The power range of SIMD instructions .....
     146Figure \ref{corei3_power} shows the average power consumed by the four
     147different parsers.  The average power of corei3-530 is about 21 watts.
     148This model released by Intel last year has a good reputation for power
     149efficiency.  Parabix2 dominated by SIMD instructions uses only about
     1505\% higher power than the other parsers.
    129151
    130152\begin{figure}
     
    136158\end{figure}
    137159
    138 Figure \ref{corei3_energy} shows the energy consumption of the four different parsers.
    139 Although Parabix2 needs slight higer power, its processing time is much shorter and therefore consumes much less energy.
    140 Parabix2 consumes 50 to 75 nJ per byte while Expat and Xerces consumes 80nJ to 320nJ and 140nJ to 370nJ per byte seperately.
     160The more interesting trend is energy, Figure \ref{corei3_energy} shows
     161the energy consumption of the four different parsers.  Although
     162Parabix2 needs slight higer power, its processing time is much shorter
     163and therefore consumes much less energy.  Parabix2 consumes 50 to 75
     164nJ per byte while Expat and Xerces consumes 80nJ to 320nJ and 140nJ to
     165370nJ per byte seperately.
    141166
    142167\begin{figure}
Note: See TracChangeset for help on using the changeset viewer.