Changeset 1039


Ignore:
Timestamp:
Mar 25, 2011, 8:27:43 PM (9 years ago)
Author:
lindanl
Message:

macros for core2 corei3 sandybridge

Location:
docs/PACT2011
Files:
7 edited

Legend:

Unmodified
Added
Removed
  • docs/PACT2011/00-abstract.tex

    r1037 r1039  
    2121against two widely-used XML parsers, James Clark's Expat and Apache's Xerces-C
    2222on three generations of x86 machines, including the new Intel
    23 Sandybridge.    We show that Parabix2's speedup is 2$\times$--7$\times$
     23\SB{}.    We show that Parabix2's speedup is 2$\times$--7$\times$
    2424over Expat and Xerces.  In stark contrast to the energy expenditures necessary
    2525to realize performance gains through multicore parallelism, we also show
  • docs/PACT2011/01-intro.tex

    r1025 r1039  
    7979for the performance and energy study tackled in the
    8080remainder of the paper.   Section 5 presents a
    81 detailed performance evaluation on a Core i3 processor
     81detailed performance evaluation on a \CI\ processor
    8282as our primary evaluation platform, addressing a
    8383number of microarchitectural issues including cache
     
    8686performance gains through three generations of Intel
    8787architecture culminating with performance assessment
    88 on our two week-old Sandy Bridge test machine.
     88on our two week-old \SB\ test machine.
    8989Section 7 looks specifically at issues in applying
    9090the new 256-bit AVX technology to parallel bit stream
  • docs/PACT2011/04-methodology.tex

    r1034 r1039  
    9494
    9595\subsection{Platform Hardware}
    96 \paragraph{Intel Core 2}
    97 The Intel Core 2 is a Conroe based processor produced by
     96\paragraph{Intel \CO{}}
     97The Intel \CO\ is a Conroe based processor produced by
    9898Intel. Table \ref{core2info} gives the hardware description of the
    99 Intel Core 2 machine selected.
     99Intel \CO\ machine selected.
    100100\begin{table}[h]
    101101\begin{center}
    102102\begin{tabular}{|c||c|}
    103103\hline
    104 Processor & Intel Core 2 Duo processor 6400  (2.13GHz) \\ \hline
     104Processor & Intel Core2 Duo processor 6400  (2.13GHz) \\ \hline
    105105L1 Cache & 32KB I-Cache, 32KB D-Cache \\ \hline
    106106L2 Cache & 2MB \\ \hline
     
    115115\end{table}
    116116
    117 \paragraph {Intel Core i3}
    118 The Intel Core i3 is a Nehalem based processor produced by Intel. The
     117\paragraph {Intel \CI{}}
     118The Intel \CI\ is a Nehalem based processor produced by Intel. The
    119119intent of this processor is to serve as an example low end server
    120120processor. Table \ref{i3info} gives the hardware description of the
    121 Intel Core i3 machine selected.
     121Intel \CI\ machine selected.
    122122
    123123\begin{table}[h]
     
    125125\begin{tabular}{|c||c|}
    126126\hline
    127 Processor & Intel Clarkdale I3-530 (2.93GHz) \\ \hline
     127Processor & Intel i3-530 (2.93GHz) \\ \hline
    128128L1 Cache & 32KB I-Cache, 32K D-Cache \\ \hline 
    129129L2 Cache & 256KB \\ \hline
     
    136136\end{tabular}
    137137\end{center}
    138 \caption{Core i3}
     138\caption{\CI{}}
    139139\label{i3info}
    140140\end{table}
    141141
    142142\paragraph{Intel Core i5}
    143 The Intel Core i5 is a Sandy Bridge based processor produced by
     143The Intel Core i5 is a \SB\ based processor produced by
    144144Intel. Table \ref{sandybridgeinfo} gives the hardware description of the
    145 Intel Core i3 machine selected.
     145Intel \CI\ machine selected.
    146146
    147147\begin{table}[h]
     
    149149\begin{tabular}{|c||c|}
    150150\hline
    151 Processor & Intel Core I5-2300 (2.80GHz) \\ \hline
     151Processor & Intel Sandybridge i5-2300 (2.80GHz) \\ \hline
    152152L1 Cache &  192 KB\\ \hline     
    153153L2 Cache &  4 X 256KB \\ \hline
     
    160160\end{tabular}
    161161\end{center}
    162 \caption{Sandy Bridge}
     162\caption{\SB{}}
    163163\label{sandybridgeinfo}
    164164\end{table}
  • docs/PACT2011/05-corei3.tex

    r1004 r1039  
    33%some of the numbers are roughly calculated, needs to be recalculated for final version
    44\subsection{Cache behavior}
    5 Core i3 has a three level cache hierarchy.  The miss penalty for each
     5\CI\ has a three level cache hierarchy.  The miss penalty for each
    66level is about 4 cycles, 11 cycles, and 36 cycles.  Figure
    77\ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
     
    2121\includegraphics[width=0.5\textwidth]{plots/corei3_L1DM.pdf}
    2222\end{center}
    23 \caption{L1 Data Cache Misses on Core i3 (y-axis: Cache Misses per KByte)}
     23\caption{L1 Data Cache Misses on \CI\ (y-axis: Cache Misses per KByte)}
    2424\label{corei3_L1DM}
    2525\end{figure}
     
    2929\includegraphics[width=0.5\textwidth]{plots/corei3_L2DM.pdf}
    3030\end{center}
    31 \caption{L2 Data Cache Misses on Core i3 (y-axis: Cache Misses per KByte)}
     31\caption{L2 Data Cache Misses on \CI\ (y-axis: Cache Misses per KByte)}
    3232\label{corei3_L2DM}
    3333\end{figure}
     
    3737\includegraphics[width=0.5\textwidth]{plots/corei3_L3CM.pdf}
    3838\end{center}
    39 \caption{L3 Cache Misses on Core i3 (y-axis: Cache Misses per KByte)}
     39\caption{L3 Cache Misses on \CI\ (y-axis: Cache Misses per KByte)}
    4040\label{corei3_L3TM}
    4141\end{figure}
     
    6767\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
    6868\end{center}
    69 \caption{Branches on Core i3 (y-axis: Branches per KByte)}
     69\caption{Branches on \CI\ (y-axis: Branches per KByte)}
    7070\label{corei3_BR}
    7171\end{figure}
     
    7575\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
    7676\end{center}
    77 \caption{Branch Mispredictions on Core i3 (y-axis: Branch Mispredictions per KByte)}
     77\caption{Branch Mispredictions on \CI\ (y-axis: Branch Mispredictions per KByte)}
    7878\label{corei3_BM}
    7979\end{figure}
     
    133133\includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
    134134\end{center}
    135 \caption{Processing Time on Core i3 (y-axis: Total CPU Cycles per KByte)}
     135\caption{Processing Time on \CI\ (y-axis: Total CPU Cycles per KByte)}
    136136\label{corei3_TOT}
    137137\end{figure}
     
    154154\includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
    155155\end{center}
    156 \caption{Average Power on Core i3 (watts)}
     156\caption{Average Power on \CI\ (watts)}
    157157\label{corei3_power}
    158158\end{figure}
     
    169169\includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
    170170\end{center}
    171 \caption{Energy Consumption on Core i3 ($\mu$J per KByte)}
     171\caption{Energy Consumption on \CI\ ($\mu$J per KByte)}
    172172\label{corei3_energy}
    173173\end{figure}
  • docs/PACT2011/06-scalability.tex

    r1033 r1039  
    11\section{Scalability}
    22\subsection{Performance}
    3 Figure \ref{Scalability} (a) shows the performance of Parabix2 on three different cores: Core2, Core i3 and Sandybridge.
     3Figure \ref{Scalability} (a) shows the performance of Parabix2 on three different cores: \CO{}, \CI\ and \SB{}.
    44The average processing time of the five workloads, which is evaluated as CPU cycles per thousand bytes,
    55is divided up by bitstream parsing and byte space postprocessing.
    66Bitstream parsing, mainly consists of SIMD instructions,
    7 is able to achieve 17\% performance improvement moving from Core2 to Core i3;
    8 22\% performance improvement moving from Core i3 to Sandybridge,
     7is able to achieve 17\% performance improvement moving from \CO\ to \CI{};
     822\% performance improvement moving from \CI\ to \SB{},
    99which is relatively stable compared to postprocessing,
    10 which gains 18\% to 31\% performance moving from Core2 to Core i3;
    11 0 to 17\% performance improvement moving from Core i3 to Sandybridge.
     10which gains 18\% to 31\% performance moving from \CO\ to \CI{};
     110 to 17\% performance improvement moving from \CI\ to \SB{}.
    1212
    1313As comparison, we also measured the performance of Expat on all the three cores, which is shown is Figure \ref{Scalability} (b).
    14 The performance improvement is less than 5\% by running Expat on Core i3 instead of Core2
    15 and it is less than 10\% by running on Sandybridge instead of Core i3.
     14The performance improvement is less than 5\% by running Expat on \CI\ instead of \CO\
     15and it is less than 10\% by running on \SB\ instead of \CI{}.
    1616
    1717Parabix2 scales much better than Expat and is able to achieve an overall performance improvement
    1818up to 26\% simply by running the same code on a newer core.
    19 Further improvement on Sandybridge with AVX will be discussed in the next section.
     19Further improvement on \SB\ with AVX will be discussed in the next section.
    2020
    2121\begin{figure}
     
    3535
    3636The newer processors are not only designed to have better performance but also more energy-efficient.
    37 Figure \ref{power_Parabix2} shows the average power when running Parabix2 on Core2, Core i3 and Sandybridge with different input files.
    38 On Core2, the average power is about 32 watts. Core i3 saves 30\% of the power compared with Core2.
    39 Sandybridge saves 25\% of the power compared with Core i3 and consumes only 15 watts.
     37Figure \ref{power_Parabix2} shows the average power when running Parabix2 on \CO{}, \CI\ and \SB\ with different input files.
     38On \CO{}, the average power is about 32 watts. \CI\ saves 30\% of the power compared with \CO{}.
     39\SB\ saves 25\% of the power compared with \CI\ and consumes only 15 watts.
    4040
    4141The energy consumption is further improved by better performance, which means a shorter processing time, as we moved to the newer cores.
    42 As a result, Parabix2 on Sandybridge cost 72\% to 75\% less energy than Parabix2 on Core2.
     42As a result, Parabix2 on \SB\ cost 72\% to 75\% less energy than Parabix2 on \CO{}.
    4343
    4444\begin{figure}
  • docs/PACT2011/07-avx.tex

    r1038 r1039  
    66advantage of the new 256-bit AVX (Advanced Vector Extensions)
    77technology that has just become commercially available in the
    8 latest Intel processors based on the Sandy Bridge microarchitecture.
     8latest Intel processors based on the \SB\ microarchitecture.
    99
    1010\begin{figure*}
     
    4646With the introduction of 256-bit SIMD registers with AVX technology,
    4747one might ideally expect up to a 50\% reduction in the instruction
    48 count for the SIMD workload of Parabix2.   However, in the Sandy Bridge
     48count for the SIMD workload of Parabix2.   However, in the \SB\
    4949implementation, Intel has focused on implementing floating point
    5050operations as opposed to the integer based operations.  That is,
     
    110110does not improve significantly and actually degrades for files with
    111111higher markup density (average 10\%). Dewiki.xml, on which bitwise-SIMD instructions reduced by 39\%,  saw a performance improvement of 8\%.
    112 We believe that this is primarily due to the intricacies of the first generation AVX implemention in Sandy Bridge,
     112We believe that this is primarily due to the intricacies of the first generation AVX implemention in \SB{},
    113113with significant latency in many of the 256-bit instructions in comparison to their
    114114128-bit counterparts. The 256-bit instructions also have different scheduling constraints that seem to reduce overall SIMD throughput.   If these latency issues can be addressed
  • docs/PACT2011/main.tex

    r1036 r1039  
    1212\usepackage{amssymb}    % for \varnothing (empty set) symbol
    1313\def\lb{\linebreak[1]}
     14\def\CI{Core-i3}
     15\def\SB{SandyBridge}
     16\def\CO{Core2}
    1417\DeclareRobustCommand{\=}{\_\linebreak[1]}
    1518\pagenumbering{arabic}
Note: See TracChangeset for help on using the changeset viewer.