Changeset 1366


Ignore:
Timestamp:
Aug 24, 2011, 3:28:57 PM (8 years ago)
Author:
ashriram
Message:

Finished Section 5

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/05-corei3.tex

    r1365 r1366  
    11\section{Efficiency of the Parabix}
    22\label{section:baseline}
    3 
    4 %[AS] Need to say what we are doing here.
    5 
    6 brief, for each of the four XML parsers under study we propose to measure
    7 and evaluate the energy consumption required to carry out XML
    8 well-formedness checking, under a variety of workloads, and as
    9 executed on three different Intel processors.
    10 
    11 
     3In this section we analyze the energy and performance characteristics
     4of the Parabix-based XML parser against the software XML parsers,
     5Xerces and Expat. For our baseline evaluation, we compare all the XML
     6parsers on a fixed platform, the \CITHREE{}.
    127
    138
    149%some of the numbers are roughly calculated, needs to be recalculated for final version
    15 \subsection{Cache behavior}
    16 \CITHREE\  cache hierarchy.  The approximate miss
    17 penalty for each cache level is 4, 11, and 36 cycles respectively.
    18 Figure \ref{corei3_L1DM}, Figure \ref{corei3_L2DM} and Figure
    19 \ref{corei3_L3TM} show the L1, L2 and L3 data cache misses for each of
    20 the parsers.  Although XML parsing is non memory intensive
    21 application, cache misses for the Expat and Xerces parsers represent a
    22 0.5 cycle per XML byte cost whereas the performance of the Parabix
    23 parsers remains essentially unaffected by data cache misses.  Cache
    24 misses not only consume additional CPU cycles but increase application
    25 energy consumption.  L1, L2, and L3 cache misses consume approximately
    26 8.3nJ, 19nJ, and 40nJ respectively. As such, given a 1GB XML file as
    27 input, Expat and Xerces would consume over 0.6J and 0.9J respectively
    28 due to cache misses alone.
     10\subsection{Cache behaviour}
     11The approximate miss penalty in \CITHREE\ for L1, L2 and L3 caches is
     124, 11, and 36 cycles respectively.  The L1 (32KB) and L2 cache (256KB)
     13are private per core, while the 4MB L3 is shared by all the
     14cores. Figure \ref{cache_misses} shows the cache misses per kilobyte
     15of input data. Analytically, the cache misses for the Expat and Xerces
     16parsers represent a 0.5 cycle per XML byte processed. This overhead
     17does not necessarily reflect in the overall performance of these
     18parsers as they experience other overheads related to branch
     19mispredictions. Parabix's data reorganization significantly improves
     20the overall cache miss rate. We experience ?$\times$ less misses at
     21the L1 and ?$\times$ less misses at the L2 level. The improved cache
     22utilization keeps the SIMD units busy and prevent memory related
     23stalls. Note that cache misses also cause increased application energy
     24consumption due to increased energy required to access higher levels
     25in the cache hierarchy. We estimated with microbenchmarks that the L1,
     26L2, and L3 cache misses consume approximately 8.3nJ, 19nJ, and 40nJ
     27respectively. For a 1GB XML file Expat and Xerces would consume over
     280.6J and 0.9J respectively due to cache misses alone.
    2929%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
    3030%would consume 0.9J on cache misses alone.
     
    4545}
    4646\caption{Cache Misses per kB of input data.}
     47\label{cache_misses}
    4748\end{figure}
    4849
    4950\subsection{Branch Mispredictions}
    50 Despite improvements in branch prediction, branch misprediction
    51 penalties contribute significantly to XML parsing performance. On
    52 modern commodity processors the cost of a single branch misprediction
    53 is commonly cited as over 10 CPU cycles.  As shown in Figure
    54 \ref{corei3_BM}, the cost of branch mispredictions for the Expat
    55 parser can be over 7 cycles per XML byte---this cost alone is equal to
    56 the average total cost for Parabix2 to process each byte of XML.
     51In general, reducing the branch misprediction rate is difficult in
     52text-based XML parsing applications. This is due to (1) variable
     53length nature of the syntactic elements contained within XML
     54documents, (2) a data dependent characteristic, and (3) the extensive
     55set of syntax constraints imposed by the XML. Traditional
     56byte-at-a-time XML parser's performance is limited by the number of
     57branch mispredictions.  As shown in Figure \ref{corei3_BR}, Xerces
     58averages up to 13 branches per XML byte processed on high density
     59markup. On modern commodity processors the cost of a single branch
     60misprediction is incur over 10s of CPU cycles to restart the processor
     61pipeline. The high miss prediction rate in conventional parsers add
     62significant overhead. In Parabix the transformation to SIMD operation
     63eliminates many branches. Further optimizations take advantage of
     64Parabix's data organization and replace condition branches with {\em
     65  bit scan} operations that can process up to 64 characters worth of
     66branches with one operation. In many cases, we also replace the
     67branches with logical predicate operations. Our predicate are cheaper
     68to compute since they involve only bit parallel SIMD operations.
    5769
    58 In general, reducing the branch misprediction rate is difficult in
    59 text-based XML parsing applications. This is due in part to the
    60 variable length nature of the syntactic elements contained within XML
    61 documents, a data dependent characterstic, as well as the extensive
    62 set of syntax constraints imposed by the XML 1.0 specification. As
    63 such, traditional byte-at-a-time XML parsers generate a performance
    64 limiting number of branch mispredictions.  As shown in Figure
    65 \ref{corei3_BR}, Xerces averages up to 13 branches per XML byte
    66 processed on high density markup.
     70 As shown in Figure \ref{corei3_BR},
     71Parabix processing is almost branch free. Parabix exhibits minimal
     72dependence on source XML markup density; it experiences a constant
     73number of branch mispredictions irrespective of the input. The cost of
     74branch mispredictions for the Expat parser can be over 7 cycles per
     75XML byte (see Figure \ref{corei3_BM}) ---this cost alone is higher
     76than the average latency of a byte processed by Parabix.
    6777
    68 The performance improvement of Parabix1 in terms of branch
    69 mispredictions results from the veritable elimination of conditional
    70 branch instructions in scanning. Leveraging the processor built-in
    71 {\em bit scan} operation together with parallel bit stream technology
    72 Parabix1 can scan up to 64 bytes of source XML with a single {\em bit
    73   scan} instruction. In comparison, a byte-at-a-time parser must
    74 process a conditional branch instruction per XML byte scanned.
    7578
    76 As shown in Figure \ref{corei3_BR}, Parabix2 processing is almost
    77 branch free. Utilizing a new parallel scanning technique based on bit
    78 stream addition, Parabix2 exhibits minimal dependence on source XML
    79 markup density. Figure \ref{corei3_BR} displays this lack of data
    80 dependence via the constant number of branch mispredictions shown for
    81 each of the source XML files.
    82 % Parabix1 minimize the branches by using parallel bit
    83 % streams.  Parabix1 still have a few branches for each block of 128
    84 % bytes (SSE) due to the sequential scanning.  But with the new parallel
    85 % scanning technique, Parabix2 is essentially branch-free as shown in
    86 % the Figure \ref{corei3_BR}.  As a result, Parabix2 has minimal
    87 % dependency on the markup density of the workloads.
    8879
    8980
     
    10394\subsection{SIMD Instructions vs. Total Instructions}
    10495
    105 Parabix achieves performance via parallel bit stream technology. In
    106 Parabix XML processing, parallel bit streams are both computed and
     96In Parabix, bit streams are both computed and
    10797predominately operated upon using the SIMD instructions of commodity
    10898processors.  The ratio of retired SIMD instructions to total
    109 instructions provides insight into\ the relative degree to which
     99instructions provides insight into the relative degree to which
    110100Parabix achieves parallelism over the byte-at-a-time approach.
     101
    111102
    112103Using the Intel Pin tool, we gather the dynamic instruction mix for
    113104each XML workload, and classify instructions as either vector (SIMD)
    114 or non-vector instructions.  Figures \ref{corei3_INS_p1} and
    115 \ref{corei3_INS_p2} show the percentage of SIMD instructions for
    116 Parabix1 and Parabix2 respectively.
     105or non-vector instructions.  Figures~\ref{corei3_INS_p2} show the
     106percentage of SIMD instructions for Parabix. The ratio of executed
     107SIMD instructions over total instructions indicates the amount of
     108parallel processing we were able to extract.
    117109%(Expat and Xerce do not use any SIMD instructions)
    118 For Parabix1, 18\% to 40\% of the executed instructions are SIMD instructions.  Using
    119 bit stream addition to scan XML characters in parallel, the Parabix2 instruction mix is made up of 60\% to 80\%
    120 SIMD instructions.  Although the resulting ratios are (negatively) proportional to the markup density
    121 for both Parabix1 and Parabix2, the degradation rate of
    122 Parabix2 is much lower and thus the performance penalty incurred by
    123 increasing the markup density is reduced.
     110The Parabix instruction mix is made up of 60\% to 80\% SIMD
     111instructions.  The markup density of the files influence the number of
     112scalar instructions needed to handle the tag processing which affects
     113the overall parallelism that can be extracted by Parabix.  We find
     114that degradation rate is low and thus the performance
     115penalty incurred by increasing the markup density is minimal.
    124116%Expat and Xerce do not use any SIMD instructions and were not
    125117%included in this portion of the study.
    126118
    127119% Parabix gains its performance by using parallel bitstreams, which
    128 % are mostly generated and calculated by SIMD instructions.  The ratio
    129 % of executed SIMD instructions over total instructions indicates the
    130 % amount of parallel processing we were able to achieve.  We use Intel
     120% are mostly generated and calculated by SIMD instructions.  We use Intel
    131121% pin, a dynamic binary instrumentation tool, to gather instruction
    132122% mix.  Then we adds up all the vector instructions that have been
    133123% executed.  Figure \ref{corei3_INS_p1} and Figure \ref{corei3_INS_p2}
    134 % show the percentage of SIMD instructions of Parabix1 and Parabix2
     124% show the percentage of SIMD instructions of Parabix1 and Parabix
    135125% (Expat and Xerce do not use any SIMD instructions).  For Parabix1,
    136126% 18\% to 40\% of the executed instructions consists of SIMD
     
    145135
    146136Figure \ref{corei3_TOT} shows overall parser performance evaluated in
    147 terms of CPU cycles per kilobyte.  Parabix1 is 1.5 to 2.5 times faster
    148 on document-oriented input and 2 to 3 times faster on data-oriented
    149 input than the Expat and Xerces parsers respectively.  Parabix2 is 2.5
    150 to 4 times faster on document-oriented input and 4.5 to 7 times faster
     137terms of CPU cycles per kilobyte.  Parabix is 2.5$\times$
     138to 4$\times$ faster on document-oriented input and 4.5 to 7 times faster
    151139on data-oriented input.  Traditional parsers can be dramatically
    152140slowed by dense markup, while Parabix2 is generally unaffected.  The
     
    155143Xerces, this transcoding requires several cycles per byte.  However,
    156144transcoding using parallel bit streams is significantly faster and
    157 requires less than a single cycle per byte.  \cite{Cameron2008}.
     145requires less than a single cycle per byte.
    158146
    159147
     
    172160
    173161\subsection{Power and Energy}
    174 In response to the growing industry concerns on power consumption and
    175 energy efficiency, chip producers work hard to not only improve
    176 performance but also achieve high energy efficiency in processors
    177 design. We study the power and energy consumption of Parabix in
     162In this section, we study the power and energy consumption of Parabix in
    178163comparison with Expat and Xerces on \CITHREE{}. The average power of
    179 \CITHREE\ 530 is about 21 watts.  This Intel model has a good
    180 reputation for power efficiency. Figure \ref{corei3_power} shows the
    181 average power consumed by each parser.  Parabix2, dominated by SIMD
    182 instructions, uses approximately 5\% additional power.
     164\CITHREE\ is about 21 watts. Figure \ref{corei3_power} shows the
     165average power consumed by each parser.  Parabix, dominated by SIMD
     166instructions which uses approximately 5\% additional power. While the
     167SIMD functional units are significantly wider than the scalar
     168counterparts; register width and functional unit power account only
     169for a small fraction of the overall power consumption in a processor
     170pipeline. More importantly by using data parallel operations Parabix
     171amortizes the fetch and data access overheads. This results in minimal
     172power increase compared to the conventional parsers. 
     173Perhaps the energy trends shown in Figure
     174\ref{corei3_energy} reveal an interesting trend. Parabix consumes
     175substantially less energy than the other parsers. Parabix consumes 50
     176to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and
     177140nJ to 370nJ per byte respectively.  Although Parabix
     178requires slightly more power (per instruction), the processing time of
     179Parabix is significantly lower.
     180
     181
    183182
    184183
     
    197196\end{figure}
    198197
    199 As shown in Figure \ref{corei3_energy}, a comparison of energy
    200 efficiency demonstrates a more interesting result. Although Parabix2
    201 requires slightly more power (per instruction), the processing time of
    202 Parabix2 is significantly lower, and therefore Parabix2 consumes
    203 substantially less energy than the other parsers. Parabix2 consumes 50
    204 to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and
    205 140nJ to 370nJ per byte respectively.
    206198
Note: See TracChangeset for help on using the changeset viewer.