# Changeset 1039 for docs/PACT2011

Ignore:
Timestamp:
Mar 25, 2011, 8:27:43 PM (8 years ago)
Message:

macros for core2 corei3 sandybridge

Location:
docs/PACT2011
Files:
7 edited

Unmodified
Removed
• ## docs/PACT2011/00-abstract.tex

 r1037 against two widely-used XML parsers, James Clark's Expat and Apache's Xerces-C on three generations of x86 machines, including the new Intel Sandybridge.    We show that Parabix2's speedup is 2$\times$--7$\times$ \SB{}.    We show that Parabix2's speedup is 2$\times$--7$\times$ over Expat and Xerces.  In stark contrast to the energy expenditures necessary to realize performance gains through multicore parallelism, we also show
• ## docs/PACT2011/01-intro.tex

 r1025 for the performance and energy study tackled in the remainder of the paper.   Section 5 presents a detailed performance evaluation on a Core i3 processor detailed performance evaluation on a \CI\ processor as our primary evaluation platform, addressing a number of microarchitectural issues including cache performance gains through three generations of Intel architecture culminating with performance assessment on our two week-old Sandy Bridge test machine. on our two week-old \SB\ test machine. Section 7 looks specifically at issues in applying the new 256-bit AVX technology to parallel bit stream
• ## docs/PACT2011/04-methodology.tex

 r1034 \subsection{Platform Hardware} \paragraph{Intel Core 2} The Intel Core 2 is a Conroe based processor produced by \paragraph{Intel \CO{}} The Intel \CO\ is a Conroe based processor produced by Intel. Table \ref{core2info} gives the hardware description of the Intel Core 2 machine selected. Intel \CO\ machine selected. \begin{table}[h] \begin{center} \begin{tabular}{|c||c|} \hline Processor & Intel Core 2 Duo processor 6400  (2.13GHz) \\ \hline Processor & Intel Core2 Duo processor 6400  (2.13GHz) \\ \hline L1 Cache & 32KB I-Cache, 32KB D-Cache \\ \hline L2 Cache & 2MB \\ \hline \end{table} \paragraph {Intel Core i3} The Intel Core i3 is a Nehalem based processor produced by Intel. The \paragraph {Intel \CI{}} The Intel \CI\ is a Nehalem based processor produced by Intel. The intent of this processor is to serve as an example low end server processor. Table \ref{i3info} gives the hardware description of the Intel Core i3 machine selected. Intel \CI\ machine selected. \begin{table}[h] \begin{tabular}{|c||c|} \hline Processor & Intel Clarkdale I3-530 (2.93GHz) \\ \hline Processor & Intel i3-530 (2.93GHz) \\ \hline L1 Cache & 32KB I-Cache, 32K D-Cache \\ \hline L2 Cache & 256KB \\ \hline \end{tabular} \end{center} \caption{Core i3} \caption{\CI{}} \label{i3info} \end{table} \paragraph{Intel Core i5} The Intel Core i5 is a Sandy Bridge based processor produced by The Intel Core i5 is a \SB\ based processor produced by Intel. Table \ref{sandybridgeinfo} gives the hardware description of the Intel Core i3 machine selected. Intel \CI\ machine selected. \begin{table}[h] \begin{tabular}{|c||c|} \hline Processor & Intel Core I5-2300 (2.80GHz) \\ \hline Processor & Intel Sandybridge i5-2300 (2.80GHz) \\ \hline L1 Cache &  192 KB\\ \hline L2 Cache &  4 X 256KB \\ \hline \end{tabular} \end{center} \caption{Sandy Bridge} \caption{\SB{}} \label{sandybridgeinfo} \end{table}
• ## docs/PACT2011/05-corei3.tex

 r1004 %some of the numbers are roughly calculated, needs to be recalculated for final version \subsection{Cache behavior} Core i3 has a three level cache hierarchy.  The miss penalty for each \CI\ has a three level cache hierarchy.  The miss penalty for each level is about 4 cycles, 11 cycles, and 36 cycles.  Figure \ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure \includegraphics[width=0.5\textwidth]{plots/corei3_L1DM.pdf} \end{center} \caption{L1 Data Cache Misses on Core i3 (y-axis: Cache Misses per KByte)} \caption{L1 Data Cache Misses on \CI\ (y-axis: Cache Misses per KByte)} \label{corei3_L1DM} \end{figure} \includegraphics[width=0.5\textwidth]{plots/corei3_L2DM.pdf} \end{center} \caption{L2 Data Cache Misses on Core i3 (y-axis: Cache Misses per KByte)} \caption{L2 Data Cache Misses on \CI\ (y-axis: Cache Misses per KByte)} \label{corei3_L2DM} \end{figure} \includegraphics[width=0.5\textwidth]{plots/corei3_L3CM.pdf} \end{center} \caption{L3 Cache Misses on Core i3 (y-axis: Cache Misses per KByte)} \caption{L3 Cache Misses on \CI\ (y-axis: Cache Misses per KByte)} \label{corei3_L3TM} \end{figure} \includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf} \end{center} \caption{Branches on Core i3 (y-axis: Branches per KByte)} \caption{Branches on \CI\ (y-axis: Branches per KByte)} \label{corei3_BR} \end{figure} \includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf} \end{center} \caption{Branch Mispredictions on Core i3 (y-axis: Branch Mispredictions per KByte)} \caption{Branch Mispredictions on \CI\ (y-axis: Branch Mispredictions per KByte)} \label{corei3_BM} \end{figure} \includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf} \end{center} \caption{Processing Time on Core i3 (y-axis: Total CPU Cycles per KByte)} \caption{Processing Time on \CI\ (y-axis: Total CPU Cycles per KByte)} \label{corei3_TOT} \end{figure} \includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf} \end{center} \caption{Average Power on Core i3 (watts)} \caption{Average Power on \CI\ (watts)} \label{corei3_power} \end{figure} \includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf} \end{center} \caption{Energy Consumption on Core i3 ($\mu$J per KByte)} \caption{Energy Consumption on \CI\ ($\mu$J per KByte)} \label{corei3_energy} \end{figure}
• ## docs/PACT2011/06-scalability.tex

 r1033 \section{Scalability} \subsection{Performance} Figure \ref{Scalability} (a) shows the performance of Parabix2 on three different cores: Core2, Core i3 and Sandybridge. Figure \ref{Scalability} (a) shows the performance of Parabix2 on three different cores: \CO{}, \CI\ and \SB{}. The average processing time of the five workloads, which is evaluated as CPU cycles per thousand bytes, is divided up by bitstream parsing and byte space postprocessing. Bitstream parsing, mainly consists of SIMD instructions, is able to achieve 17\% performance improvement moving from Core2 to Core i3; 22\% performance improvement moving from Core i3 to Sandybridge, is able to achieve 17\% performance improvement moving from \CO\ to \CI{}; 22\% performance improvement moving from \CI\ to \SB{}, which is relatively stable compared to postprocessing, which gains 18\% to 31\% performance moving from Core2 to Core i3; 0 to 17\% performance improvement moving from Core i3 to Sandybridge. which gains 18\% to 31\% performance moving from \CO\ to \CI{}; 0 to 17\% performance improvement moving from \CI\ to \SB{}. As comparison, we also measured the performance of Expat on all the three cores, which is shown is Figure \ref{Scalability} (b). The performance improvement is less than 5\% by running Expat on Core i3 instead of Core2 and it is less than 10\% by running on Sandybridge instead of Core i3. The performance improvement is less than 5\% by running Expat on \CI\ instead of \CO\ and it is less than 10\% by running on \SB\ instead of \CI{}. Parabix2 scales much better than Expat and is able to achieve an overall performance improvement up to 26\% simply by running the same code on a newer core. Further improvement on Sandybridge with AVX will be discussed in the next section. Further improvement on \SB\ with AVX will be discussed in the next section. \begin{figure} The newer processors are not only designed to have better performance but also more energy-efficient. Figure \ref{power_Parabix2} shows the average power when running Parabix2 on Core2, Core i3 and Sandybridge with different input files. On Core2, the average power is about 32 watts. Core i3 saves 30\% of the power compared with Core2. Sandybridge saves 25\% of the power compared with Core i3 and consumes only 15 watts. Figure \ref{power_Parabix2} shows the average power when running Parabix2 on \CO{}, \CI\ and \SB\ with different input files. On \CO{}, the average power is about 32 watts. \CI\ saves 30\% of the power compared with \CO{}. \SB\ saves 25\% of the power compared with \CI\ and consumes only 15 watts. The energy consumption is further improved by better performance, which means a shorter processing time, as we moved to the newer cores. As a result, Parabix2 on Sandybridge cost 72\% to 75\% less energy than Parabix2 on Core2. As a result, Parabix2 on \SB\ cost 72\% to 75\% less energy than Parabix2 on \CO{}. \begin{figure}
• ## docs/PACT2011/07-avx.tex

 r1038 advantage of the new 256-bit AVX (Advanced Vector Extensions) technology that has just become commercially available in the latest Intel processors based on the Sandy Bridge microarchitecture. latest Intel processors based on the \SB\ microarchitecture. \begin{figure*} With the introduction of 256-bit SIMD registers with AVX technology, one might ideally expect up to a 50\% reduction in the instruction count for the SIMD workload of Parabix2.   However, in the Sandy Bridge count for the SIMD workload of Parabix2.   However, in the \SB\ implementation, Intel has focused on implementing floating point operations as opposed to the integer based operations.  That is, does not improve significantly and actually degrades for files with higher markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD instructions reduced by 39\%,  saw a performance improvement of 8\%. We believe that this is primarily due to the intricacies of the first generation AVX implemention in Sandy Bridge, We believe that this is primarily due to the intricacies of the first generation AVX implemention in \SB{}, with significant latency in many of the 256-bit instructions in comparison to their 128-bit counterparts. The 256-bit instructions also have different scheduling constraints that seem to reduce overall SIMD throughput.   If these latency issues can be addressed
• ## docs/PACT2011/main.tex

 r1036 \usepackage{amssymb}    % for \varnothing (empty set) symbol \def\lb{\linebreak[1]} \def\CI{Core-i3} \def\SB{SandyBridge} \def\CO{Core2} \DeclareRobustCommand{\=}{\_\linebreak[1]} \pagenumbering{arabic}
Note: See TracChangeset for help on using the changeset viewer.