Changeset 1048


Ignore:
Timestamp:
Mar 27, 2011, 6:30:11 PM (9 years ago)
Author:
ksherdy
Message:

General edits.

Location:
docs/PACT2011
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • docs/PACT2011/05-corei3.tex

    r1042 r1048  
    44\subsection{Cache behavior}
    55\CI\ has a three level cache hierarchy.  The miss penalty for each
    6 level is about 4 cycles, 11 cycles, and 36 cycles.  Figure
     6level is approximately 4, 11, and 36 cycles respectively.  Figure
    77\ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
    88\ref{corei3_L3TM} show the L1, L2 and L3 data cache misses of all the
    99four parsers.  Although XML parsing is not a memory intensive
    1010application, the cost of cache miss for Expat and Xerces can be about
    11 half cycle per byte while the performance of Parabix is hardly
    12 affected by cache misses.  Cache miss isn't just a problem for
     11half a cycle per byte while the performance of Parabix is essentially
     12unaffected by cache misses.  Cache misses are not just a problem for
    1313performance but also energy consumption.  L1 cache miss cost about
    14148.3nJ; L2 cache miss cost about 19nJ; L3 cache miss cost about 40nJ.
    15 With a 1GB input file, Expat would consume more than 0.6J and Xerces
    16 would consume 0.9J on cache misses alone.
     15With a 1GB input file, Expat and Xerces would consume over 0.6J and 0.9J due to cache misses alone respectively.
     16%With a 1GB input file, Expat would consume more than 0.6J and Xerces
     17%would consume 0.9J on cache misses alone.
    1718
    1819
     
    4344\subsection{Branch Mispredictions}
    4445Despite years of improvement, branch misprediction is still a
    45 significant bottleneck of performance.  The penalty of a branch
    46 misprediction is generally more than 10 CPU cycles.  As shown in
    47 Figure \ref{corei3_BM}, the cost of branch mispredictions for Expat
    48 can be more than 7 cycles per byte, which is as much as the processing
    49 time of Parabix2 on the same workload.
     46significant bottleneck when it comes to performance.  The cost of a branch
     47misprediction is generally over 10 CPU cycles.  As shown in
     48Figure \ref{corei3_BM}, the cost of branch mispredictions per byte of XML for Expat
     49can be over 7 cycles---which is approximately the number of cycles
     50required by Parabix2 to process a byte of XML data using the same workload.
    5051
    51 Reducing the branch misprediction rate is difficult for text-based
     52But reducing the branch misprediction rate is difficult for text-based
    5253applications due to the variable-length nature of syntactic elements.
    53 Therefore, the alternative solution of reducing branches becomes more
    54 attractive.  However, the traditional byte-at-a-time method of XML
    55 parsing usually involves large amount of inevitable branches.  As
     54Therefore, the goal is to reduce the total number of branches.  However, traditional byte-at-a-time XML
     55parsing requires a large number of inevitable branches.  As
    5656shown in Figure \ref{corei3_BR}, Xerces can have an average of 13
    5757branches for each byte it processed on the high markup density file.
    58 Parabix substantially eliminate the branches by using parallel bit
    59 streams.  Parabix1 still have a few branches for each block of 128
    60 bytes (SSE) due to the sequential scanning.  But with the new parallel
    61 scanning technique, Parabix2 is essentially branch-free as shown in
    62 the Figure \ref{corei3_BR}.  As a result, Parabix2 has minimal
     58Parabix1 minimizes the branches by using parallel bit streams for each 128-bit block but still requires a few
     59branches for sequential scanning. Utilizing the new parallel scanning technique, Parabix2 is relatively branch-free, as shown in Figure \ref{corei3_BR}. As a result, Parabix2 has minimal
    6360dependency on the markup density of the workloads.
     61% Parabix1 minimize the branches by using parallel bit
     62% streams.  Parabix1 still have a few branches for each block of 128
     63% bytes (SSE) due to the sequential scanning.  But with the new parallel
     64% scanning technique, Parabix2 is essentially branch-free as shown in
     65% the Figure \ref{corei3_BR}.  As a result, Parabix2 has minimal
     66% dependency on the markup density of the workloads.
    6467
    6568\begin{figure}
     
    7982\end{figure}
    8083
    81 \subsection{SIMD/Total Instructions}
     84\subsection{SIMD Instructions vs. Total Instructions}
    8285
    8386Parabix gains its performance by using parallel bitstreams, which are
    8487mostly generated and calculated by SIMD instructions.  The ratio of
    8588executed SIMD instructions over total instructions indicates the
    86 amount of parallel processing we were able to achieve.  We use Intel
    87 pin, a dynamic binary instrumentation tool, to gather instruction mix.
    88 Then we adds up all the vector instructions that have been executed.
    89 Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} show the
    90 percentage of SIMD instructions of Parabix1 and Parabix2 (Expat and
    91 Xerce do not use any SIMD instructions).  For Parabix1, 18\% to 40\%
     89amount of parallel processing we were able to achieve. 
     90Using Intel PIN, a dynamic binary instrumentation tool, we gathered the running instruction mix of each XML workload and classified the instructions as either vector (SIMD-based) instructions or non-vector (Non-SIMD-based) instructions.
     91Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} shows the
     92percentage of SIMD instructions of Parabix1 and Parabix2
     93%(Expat and Xerce do not use any SIMD instructions)
     94.  For Parabix1, 18\% to 40\%
    9295of the executed instructions consists of SIMD instructions.  By using
    9396bistream addition for parallel scanning, Parabix2 uses 60\% to 80\%
    94 SIMD instructions.  Although the ratio decrease as the markup density
    95 increase for both Parabix1 and Parabix2, the decreasing rate of
    96 Parabix2 is much lower and thus the performance degradation caused by
    97 increasing markup density is smaller.
     97SIMD instructions.  Although the resulting ratios are (negatively) proportional to the markup density
     98for both Parabix1 and Parabix2, the degradation rate of
     99Parabix2 is much lower and thus the performance penalty incurred by
     100increasing the markup density is reduced.
     101%Expat and Xerce do not use any SIMD instructions and were not included in this portion of the study.
     102
     103% Parabix gains its performance by using parallel bitstreams, which are
     104% mostly generated and calculated by SIMD instructions.  The ratio of
     105% executed SIMD instructions over total instructions indicates the
     106% amount of parallel processing we were able to achieve.  We use Intel
     107% pin, a dynamic binary instrumentation tool, to gather instruction mix.
     108% Then we adds up all the vector instructions that have been executed.
     109% Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2} show the
     110% percentage of SIMD instructions of Parabix1 and Parabix2 (Expat and
     111% Xerce do not use any SIMD instructions).  For Parabix1, 18\% to 40\%
     112% of the executed instructions consists of SIMD instructions.  By using
     113% bistream addition for parallel scanning, Parabix2 uses 60\% to 80\%
     114% SIMD instructions.  Although the ratio decrease as the markup density
     115% increase for both Parabix1 and Parabix2, the decreasing rate of
     116% Parabix2 is much lower and thus the performance degradation caused by
     117% increasing markup density is smaller.
     118
    98119
    99120\begin{figure}
     
    116137
    117138Figure \ref{corei3_TOT} shows the result of the overall performance
    118 evaluated as CPU cycles per thousands input bytes.  Parabix1 is 1.5 to
     139evaluated as CPU cycles per thousand input bytes.  Parabix1 is 1.5 to
    1191402.5 times faster on document-oriented input and 2 to 3 times faster on
    120141data-oriented input compared with Expat and Xerces.  Parabix2 is 2.5
     
    140161There is a growing concern of power consumption and energy efficiency.
    141162Chip producers not only work on improving the performance but also
    142 have worked hard to develop power efficient chips.  We studied the
     163have worked hard to develop power efficient chips. We studied the
    143164power and energy consumption of Parabix in comparison with Expat and
    144165Xerces on \CI{}. 
     
    160181The more interesting trend is energy, Figure \ref{corei3_energy} shows
    161182the energy consumption of the four different parsers.  Although
    162 Parabix2 needs slight higer power, its processing time is much shorter
    163 and therefore consumes much less energy. Parabix2 consumes 50 to 75
     183Parabix2 requires slightly more power (per instruction), its processing time is significantly lower
     184and therefore consumes substantially less energy than the other parsers. Parabix2 consumes 50 to 75
    164185nJ per byte while Expat and Xerces consumes 80nJ to 320nJ and 140nJ to
    165186370nJ per byte seperately.
  • docs/PACT2011/06-scalability.tex

    r1039 r1048  
    44The average processing time of the five workloads, which is evaluated as CPU cycles per thousand bytes,
    55is divided up by bitstream parsing and byte space postprocessing.
    6 Bitstream parsing, mainly consists of SIMD instructions,
     6Bitstream parsing, which mainly consists of SIMD instructions,
    77is able to achieve 17\% performance improvement moving from \CO\ to \CI{};
    8822\% performance improvement moving from \CI\ to \SB{},
  • docs/PACT2011/07-avx.tex

    r1039 r1048  
    22
    33Parabix2 was originally developed for 128-bit SSE2 technology widely
    4 available on all 64-bit Intel and AMD processors.  In this section,
     4and is available on all 64-bit Intel and AMD processors.  In this section,
    55we discuss the scalability and performance of Parabix2 to take
    66advantage of the new 256-bit AVX (Advanced Vector Extensions)
Note: See TracChangeset for help on using the changeset viewer.