Ignore:
Timestamp:
Aug 21, 2011, 4:20:30 PM (8 years ago)
Author:
ashriram
Message:

Working on evaluation. Fixed Figure sizes

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/05-corei3.tex

    r1302 r1335  
    33%some of the numbers are roughly calculated, needs to be recalculated for final version
    44\subsection{Cache behavior}
    5 \CITHREE\ has a three level cache hierarchy.  The approximate miss penalty for each cache
    6 level is 4, 11, and 36 cycles respectively.  Figure
    7 \ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
    8 \ref{corei3_L3TM} show the L1, L2 and L3 data cache misses for each of the parsers.  Although XML parsing is non memory intensive
    9 application, cache misses for the Expat and Xerces parsers represent a 0.5 cycle per XML byte cost whereas the performance of the Parabix parsers remains essentially
    10 unaffected by data cache misses.  Cache misses not only consume additional CPU cycles but increase application energy consumption.  L1, L2, and L3 cache misses consume
    11 approximately 8.3nJ, 19nJ, and 40nJ respectively. As such, given a 1GB XML file as input, Expat and Xerces would consume over 0.6J and 0.9J respectively due to cache misses alone.
     5\CITHREE\ has a three level cache hierarchy.  The approximate miss
     6penalty for each cache level is 4, 11, and 36 cycles respectively.
     7Figure \ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
     8\ref{corei3_L3TM} show the L1, L2 and L3 data cache misses for each of
     9the parsers.  Although XML parsing is non memory intensive
     10application, cache misses for the Expat and Xerces parsers represent a
     110.5 cycle per XML byte cost whereas the performance of the Parabix
     12parsers remains essentially unaffected by data cache misses.  Cache
     13misses not only consume additional CPU cycles but increase application
     14energy consumption.  L1, L2, and L3 cache misses consume approximately
     158.3nJ, 19nJ, and 40nJ respectively. As such, given a 1GB XML file as
     16input, Expat and Xerces would consume over 0.6J and 0.9J respectively
     17due to cache misses alone.
    1218%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
    1319%would consume 0.9J on cache misses alone.
     
    1521
    1622\begin{figure}
    17 \begin{center}
    18 \includegraphics[width=0.5\textwidth]{plots/corei3_L1DM.pdf}
    19 \end{center}
    20 \caption{\CITHREE\ --- L1 Data Cache Misses (y-axis: Cache Misses per kB)}
     23\subfigure[L1 Misses]{
     24\includegraphics[width=0.32\textwidth]{plots/corei3_L1DM.pdf}
    2125\label{corei3_L1DM}
    22 \end{figure}
    23 
    24 \begin{figure}
    25 \begin{center}
    26 \includegraphics[width=0.5\textwidth]{plots/corei3_L2DM.pdf}
    27 \end{center}
    28 \caption{\CITHREE\ --- L2 Data Cache Misses (y-axis: Cache Misses per kB)}
     26}
     27\subfigure[L2 Misses]{
     28\includegraphics[width=0.32\textwidth]{plots/corei3_L2DM.pdf}
    2929\label{corei3_L2DM}
    30 \end{figure}
    31 
    32 \begin{figure}
    33 \begin{center}
    34 \includegraphics[width=0.5\textwidth]{plots/corei3_L3CM.pdf}
    35 \end{center}
    36 \caption{\CITHREE\ --- L3 Cache Misses (y-axis: Cache Misses per kB)}
    37 \label{corei3_L3TM}
     30}
     31\subfigure[L3 Misses]{
     32\includegraphics[width=0.32\textwidth]{plots/corei3_L3CM.pdf}
     33\label{corei3_L3DM}
     34}
     35\caption{Cache Misses per kB of input data.}
    3836\end{figure}
    3937
    4038\subsection{Branch Mispredictions}
    41 Despite improvements in branch prediction, branch misprediction penalties contribute
    42 significantly to XML parsing performance. On modern commodity processors the cost of a single branch
    43 misprediction is commonly cited as over 10 CPU cycles.  As shown in
    44 Figure \ref{corei3_BM}, the cost of branch mispredictions for the Expat parser
    45 can be over 7 cycles per XML byte---this cost alone is equal to the average total cost for Parabix2 to process each byte of XML.
     39Despite improvements in branch prediction, branch misprediction
     40penalties contribute significantly to XML parsing performance. On
     41modern commodity processors the cost of a single branch misprediction
     42is commonly cited as over 10 CPU cycles.  As shown in Figure
     43\ref{corei3_BM}, the cost of branch mispredictions for the Expat
     44parser can be over 7 cycles per XML byte---this cost alone is equal to
     45the average total cost for Parabix2 to process each byte of XML.
    4646
    47 In general, reducing the branch misprediction rate is difficult in text-based XML parsing
    48 applications. This is due in part to the variable length nature of the syntactic elements contained within XML documents, a data dependent characterstic,
    49 as well as the extensive set of syntax constraints imposed by the XML 1.0 specification. As such, traditional byte-at-a-time XML parsers generate a performance limiting
    50 number of branch mispredictions.  As shown in Figure \ref{corei3_BR}, Xerces averages up to 13
    51 branches per XML byte processed on high density markup.
     47In general, reducing the branch misprediction rate is difficult in
     48text-based XML parsing applications. This is due in part to the
     49variable length nature of the syntactic elements contained within XML
     50documents, a data dependent characterstic, as well as the extensive
     51set of syntax constraints imposed by the XML 1.0 specification. As
     52such, traditional byte-at-a-time XML parsers generate a performance
     53limiting number of branch mispredictions.  As shown in Figure
     54\ref{corei3_BR}, Xerces averages up to 13 branches per XML byte
     55processed on high density markup.
    5256
    53 The performance improvement of Parabix1 in terms of branch mispredictions results from the veritable elimination of conditional branch instructions in scanning. Leveraging the processor built-in {\em bit scan}
    54 operation together with parallel bit stream technology Parabix1 can scan up to 64 bytes of source XML with a single {\em bit scan} instruction. In comparison, a byte-at-a-time parser must
     57The performance improvement of Parabix1 in terms of branch
     58mispredictions results from the veritable elimination of conditional
     59branch instructions in scanning. Leveraging the processor built-in
     60{\em bit scan} operation together with parallel bit stream technology
     61Parabix1 can scan up to 64 bytes of source XML with a single {\em bit
     62  scan} instruction. In comparison, a byte-at-a-time parser must
    5563process a conditional branch instruction per XML byte scanned.
    5664
    57 As shown in Figure \ref{corei3_BR}, Parabix2 processing is almost branch free. Utilizing a new parallel scanning technique based on bit stream addition, Parabix2 exhibits minimal dependence on source XML markup density. Figure \ref{corei3_BR} displays this lack of data dependence via the constant number of branch
    58 mispredictions shown for each of the source XML files.
     65As shown in Figure \ref{corei3_BR}, Parabix2 processing is almost
     66branch free. Utilizing a new parallel scanning technique based on bit
     67stream addition, Parabix2 exhibits minimal dependence on source XML
     68markup density. Figure \ref{corei3_BR} displays this lack of data
     69dependence via the constant number of branch mispredictions shown for
     70each of the source XML files.
    5971% Parabix1 minimize the branches by using parallel bit
    6072% streams.  Parabix1 still have a few branches for each block of 128
     
    6476% dependency on the markup density of the workloads.
    6577
    66 \begin{figure}
    67 \begin{center}
    68 \includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
    69 \end{center}
    70 \caption{\CITHREE\ --- Branch Instructions (y-axis: Branches per kB)}
    71 \label{corei3_BR}
    72 \end{figure}
    7378
    7479\begin{figure}
    75 \begin{center}
    76 \includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
    77 \end{center}
    78 \caption{\CITHREE\ --- Branch Mispredictions (y-axis: Branch Mispredictions per kB)}
     80\subfigure[Branch Instructions]{
     81\includegraphics[width=0.45\textwidth]{plots/corei3_BR.pdf}
     82\label{corei3_BR}
     83}
     84\hfill
     85\subfigure[Branch Misses]{
     86\includegraphics[width=0.42\textwidth]{plots/corei3_BM.pdf}
    7987\label{corei3_BM}
     88}
     89\caption{Branch characteristics on the \CITHREE\ per kB of input data.}
    8090\end{figure}
    8191
    8292\subsection{SIMD Instructions vs. Total Instructions}
    8393
    84 Parabix achieves performance via parallel bit stream technology. In Parabix XML processing, parallel bit streams are
    85 both computed and predominately operated upon using the SIMD instructions of commodity processors.  The ratio of
    86 retired SIMD instructions to total instructions provides insight into\ the relative degree to which Parabix achieves parallelism
    87 over the byte-at-a-time approach.
     94Parabix achieves performance via parallel bit stream technology. In
     95Parabix XML processing, parallel bit streams are both computed and
     96predominately operated upon using the SIMD instructions of commodity
     97processors.  The ratio of retired SIMD instructions to total
     98instructions provides insight into\ the relative degree to which
     99Parabix achieves parallelism over the byte-at-a-time approach.
    88100
    89 Using the Intel Pin tool, we gather the dynamic instruction mix for each XML workload, and classify instructions as either vector (SIMD) or non-vector instructions.
    90 Figures \ref{corei3_INS_p1} and \ref{corei3_INS_p2} show the
    91 percentage of SIMD instructions for Parabix1 and Parabix2 respectively.
     101Using the Intel Pin tool, we gather the dynamic instruction mix for
     102each XML workload, and classify instructions as either vector (SIMD)
     103or non-vector instructions.  Figures \ref{corei3_INS_p1} and
     104\ref{corei3_INS_p2} show the percentage of SIMD instructions for
     105Parabix1 and Parabix2 respectively.
    92106%(Expat and Xerce do not use any SIMD instructions)
    93107For Parabix1, 18\% to 40\% of the executed instructions are SIMD instructions.  Using
     
    97111Parabix2 is much lower and thus the performance penalty incurred by
    98112increasing the markup density is reduced.
    99 %Expat and Xerce do not use any SIMD instructions and were not included in this portion of the study.
     113%Expat and Xerce do not use any SIMD instructions and were not
     114%included in this portion of the study.
    100115
    101 % Parabix gains its performance by using parallel bitstreams, which are
    102 % mostly generated and calculated by SIMD instructions.  The ratio of
    103 % executed SIMD instructions over total instructions indicates the
     116% Parabix gains its performance by using parallel bitstreams, which
     117% are mostly generated and calculated by SIMD instructions.  The ratio
     118% of executed SIMD instructions over total instructions indicates the
    104119% amount of parallel processing we were able to achieve.  We use Intel
    105 % pin, a dynamic binary instrumentation tool, to gather instruction mix.
    106 % Then we adds up all the vector instructions that have been executed.
    107 % Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} show the
    108 % percentage of SIMD instructions of Parabix1 and Parabix2 (Expat and
    109 % Xerce do not use any SIMD instructions).  For Parabix1, 18\% to 40\%
    110 % of the executed instructions consists of SIMD instructions.  By using
    111 % bistream addition for parallel scanning, Parabix2 uses 60\% to 80\%
    112 % SIMD instructions.  Although the ratio decrease as the markup density
    113 % increase for both Parabix1 and Parabix2, the decreasing rate of
    114 % Parabix2 is much lower and thus the performance degradation caused by
    115 % increasing markup density is smaller.
     120% pin, a dynamic binary instrumentation tool, to gather instruction
     121% mix.  Then we adds up all the vector instructions that have been
     122% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
     123% show the percentage of SIMD instructions of Parabix1 and Parabix2
     124% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
     125% 18\% to 40\% of the executed instructions consists of SIMD
     126% instructions.  By using bistream addition for parallel scanning,
     127% Parabix2 uses 60\% to 80\% SIMD instructions.  Although the ratio
     128% decrease as the markup density increase for both Parabix1 and
     129% Parabix2, the decreasing rate of Parabix2 is much lower and thus the
     130% performance degradation caused by increasing markup density is
     131% smaller.
     132
     133\subsection{CPU Cycles}
     134
     135Figure \ref{corei3_TOT} shows overall parser performance evaluated in
     136terms of CPU cycles per kilobyte.  Parabix1 is 1.5 to 2.5 times faster
     137on document-oriented input and 2 to 3 times faster on data-oriented
     138input than the Expat and Xerces parsers respectively.  Parabix2 is 2.5
     139to 4 times faster on document-oriented input and 4.5 to 7 times faster
     140on data-oriented input.  Traditional parsers can be dramatically
     141slowed by dense markup, while Parabix2 is generally unaffected.  The
     142results presented are not entirely fair to the Xerces parser since it
     143first transcodes input from UTF-8 to UTF-16 before processing. In
     144Xerces, this transcoding requires several cycles per byte.  However,
     145transcoding using parallel bit streams is significantly faster and
     146requires less than a single cycle per byte.  \cite{Cameron2008}.
    116147
    117148
    118149\begin{figure}
    119 \begin{center}
    120 \includegraphics[width=0.5\textwidth]{plots/corei3_INS_p1.pdf}
    121 \end{center}
    122 \caption{Parabix1 --- SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions}
    123 \label{corei3_INS_p1}
     150\subfigure[Performance : \# Cycles/kb]{
     151\includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
     152\label{corei3_TOT}
     153}
     154\hfill
     155\subfigure[SIMD Instruction Breakdown. Y Axis :  \% SIMD Instruction/kb]{
     156\includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf}
     157\label{corei3_INS_p2}
     158}
    124159\end{figure}
    125160
     161
     162\subsection{Power and Energy}
     163In response to the growing industry concerns on power consumption and
     164energy efficiency, chip producers work hard to not only improve
     165performance but also achieve high energy efficiency in processors
     166design. We study the power and energy consumption of Parabix in
     167comparison with Expat and Xerces on \CITHREE{}. The average power of
     168\CITHREE\ 530 is about 21 watts.  This Intel model has a good
     169reputation for power efficiency. Figure \ref{corei3_power} shows the
     170average power consumed by each parser.  Parabix2, dominated by SIMD
     171instructions, uses approximately 5\% additional power.
     172
     173
     174
     175
    126176\begin{figure}
    127 \begin{center}
    128 \includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf}
    129 \end{center}
    130 \caption{Parabix2 --- SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions)}
    131 \label{corei3_INS_p2}
     177\subfigure[Avg. Power (Watts)]{
     178\includegraphics[width=0.4\textwidth]{plots/corei3_power.pdf}
     179\label{corei3_power}
     180}
     181\hfill
     182\subfigure[Energy Consumption ($\mu$J per kB)]{
     183\includegraphics[width=0.4\textwidth]{plots/corei3_energy.pdf}
     184\label{corei3_energy}
     185}
    132186\end{figure}
    133187
    134 \subsection{CPU Cycles}
     188As shown in Figure \ref{corei3_energy}, a comparison of energy
     189efficiency demonstrates a more interesting result. Although Parabix2
     190requires slightly more power (per instruction), the processing time of
     191Parabix2 is significantly lower, and therefore Parabix2 consumes
     192substantially less energy than the other parsers. Parabix2 consumes 50
     193to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and
     194140nJ to 370nJ per byte respectively.
    135195
    136 Figure \ref{corei3_TOT} shows overall parser performance
    137 evaluated in terms of CPU cycles per kilobyte.  Parabix1 is 1.5 to
    138 2.5 times faster on document-oriented input and 2 to 3 times faster on
    139 data-oriented input than the Expat and Xerces parsers respectively.  Parabix2 is 2.5
    140 to 4 times faster on document-oriented input and 4.5 to 7 times faster
    141 on data-oriented input.  Traditional parsers can be dramatically
    142 slowed by dense markup, while Parabix2 is generally unaffected.  The results presented are not entirely fair to the
    143 Xerces parser since it first transcodes input from UTF-8 to UTF-16 before processing. In Xerces, this transcoding requires
    144 several cycles per byte.  However, transcoding using parallel
    145 bit streams is significantly faster and requires less than a single cycle per byte.
    146 \cite{Cameron2008}.
    147 
    148 \begin{figure}
    149 \begin{center}
    150 \includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
    151 \end{center}
    152 \caption{\CITHREE\ --- Performance (y-axis: CPU Cycles per kB)}
    153 \label{corei3_TOT}
    154 \end{figure}
    155 
    156 \subsection{Power and Energy}
    157 In response to the growing industry concerns on power consumption and energy efficiency,
    158 chip producers work hard to not only improve performance but
    159 also achieve high energy efficiency in processors design. We study the
    160 power and energy consumption of Parabix in comparison with Expat and
    161 Xerces on \CITHREE{}. The average power of \CITHREE\ 530 is about 21 watts.
    162 This Intel model has a good reputation for power efficiency. Figure \ref{corei3_power} shows the average power consumed by each parser.
    163 Parabix2, dominated by SIMD instructions, uses approximately 5\% additional power.     
    164 
    165 \begin{figure}
    166 \begin{center}
    167 \includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
    168 \end{center}
    169 \caption{\CITHREE\ --- Average Power Consumption (watts)}
    170 \label{corei3_power}
    171 \end{figure}
    172 
    173 As shown in Figure \ref{corei3_energy}, a comparison of energy efficiency demonstrates a more interesting result. Although
    174 Parabix2 requires slightly more power (per instruction), the processing time of Parabix2 is significantly lower,
    175 and therefore Parabix2 consumes substantially less energy than the other parsers. Parabix2 consumes 50 to 75
    176 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively.
    177 
    178 \begin{figure}
    179 \begin{center}
    180 \includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
    181 \end{center}
    182 \caption{\CITHREE\ --- Energy Consumption ($\mu$J per kB)}
    183 \label{corei3_energy}
    184 \end{figure}
    185 
Note: See TracChangeset for help on using the changeset viewer.