Changeset 1400 for docs


Ignore:
Timestamp:
Aug 30, 2011, 6:54:45 PM (8 years ago)
Author:
ksherdy
Message:

edits

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/05-corei3.tex

    r1390 r1400  
    1 \section{Efficiency of the Parabix}
     1\section{Efficiency of the Parabix-XML Parser}
    22\label{section:baseline}
    33In this section we analyze the energy and performance characteristics
    44of the Parabix-based XML parser against the software XML parsers,
    55Xerces and Expat. For our baseline evaluation, we compare all the XML
    6 parsers on a fixed platform, the \CITHREE{}.
     6parsers on the \CITHREE{}.
    77
    88
    99%some of the numbers are roughly calculated, needs to be recalculated for final version
    1010\subsection{Cache behavior}
    11 The approximate miss penalty in \CITHREE\ for L1, L2 and L3 caches is
    12 4, 11, and 36 cycles respectively.  The L1 (32KB) and L2 cache (256KB)
    13 are private per core, while the 4MB L3 is shared by all the
    14 cores. Figure \ref{cache_misses} shows the cache misses per kilobyte
     11The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is
     124, 11, and 36 cycles respectively. The L1 (32KB) and L2 cache (256KB)
     13are private per core; L3 (4MB) is shared by all the cores.
     14Figure \ref{cache_misses} shows the cache misses per kilobyte
    1515of input data. Analytically, the cache misses for the Expat and Xerces
    1616parsers represent a 0.5 cycle per XML byte processed. This overhead
    1717does not necessarily reflect in the overall performance of these
    18 parsers as they experience other overheads related to branch
    19 mispredictions. Parabix's data reorganization significantly improves
    20 the overall cache miss rate. We experience 7$\times$ less misses than
    21 Expat and 25$\times$ less misses than Xerces at the L1 and 104$\times$ less misses than
    22 Expat and 15$\times$ less misses than Xerces at the L2 level. The improved cache
    23 utilization keeps the SIMD units busy and prevent memory related
    24 stalls. Note that cache misses also cause increased application energy
    25 consumption due to increased energy required to access higher levels
    26 in the cache hierarchy. We estimated with microbenchmarks that the L1,
    27 L2, and L3 cache misses consume approximately 8.3nJ, 19nJ, and 40nJ
    28 respectively. For a 1GB XML file Expat and Xerces would consume over
     18parsers as they experience other overheads related to branch mispredictions.
     19Compared to Xerces and Expat, the data organization of Parabix-XML significantly
     20reduces the overall cache miss rate; specifically, there were $7\times$ and $15\times$
     21fewer L1 and L2 cache misses compared to the next best parser tested. The improved cache
     22utilization helps keep the SIMD units busy by minimizing memory-related stalls
     23and lowers the overall energy consumption
     24by reducing the need to access the higher levels of the cache hierarchy.
     25Using microbenchmarks, we estimated that the L1,
     26L2, and L3 cache misses consume $\sim$8.3nJ, $\sim$19nJ, and $\sim$40nJ
     27respectively. On average, with a 1GB XML file, Expat and Xerces would consume over
    29280.6J and 0.9J respectively due to cache misses alone.
    3029%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
     
    5150\subsection{Branch Mispredictions}
    5251\label{section:XML-branches}
    53 In general, reducing the branch misprediction rate is difficult in
    54 text-based XML parsing applications. This is due to (1) variable
    55 length nature of the syntactic elements contained within XML
    56 documents, (2) a data dependent characteristic, and (3) the extensive
    57 set of syntax constraints imposed by the XML. Traditional
    58 byte-at-a-time XML parser's performance is limited by the number of
    59 branch mispredictions.  As shown in Figure \ref{corei3_BR}, Xerces
    60 averages up to 13 branches per XML byte processed on high density
     52In general, performance is limited by branch mispredictions.
     53Unfortunetly, it is difficult to reduce the branch misprediction rate of
     54traditional XML parsers due to:
     55(1) the variable length nature of the syntactic elements contained within XML documents;
     56(2) a data dependent characteristic, and
     57(3) the extensive set of syntax constraints imposed by the XML 1.0/1.1 specifications.
     58% Branch mispredictions are known
     59% to signficantly degrade XML parsing performance in proportion to the markup density of the source document
     60% \cite{CameronHerdyLin2008}.
     61As shown in Figure \ref{corei3_BR},
     62Xerces averages up to 13 branches per XML byte processed on high density
    6163markup. On modern commodity processors the cost of a single branch
    6264misprediction is incur over 10s of CPU cycles to restart the processor
    63 pipeline. The high miss prediction rate in conventional parsers add
    64 significant overhead. In Parabix the transformation to SIMD operation
    65 eliminates many branches. Further optimizations take advantage of
    66 Parabix's data organization and replace condition branches with {\em
    67   bit scan} operations that can process up to 128 characters worth of
    68 branches with one operation. In many cases, we also replace the
    69 branches with logical predicate operations. Our predicates are cheaper
    70 to compute since they involve only bit parallel SIMD operations.
    71 
    72  As shown in Figure \ref{corei3_BR},
    73 Parabix processing is almost branch free. Parabix exhibits minimal
    74 dependence on source XML markup density; it experiences between 19.5 and
    75 30.7 branch mispredictions per thousand of XML byte. The cost of
     65pipeline.
     66The high miss prediction rate in conventional parsers is a significant overhead.
     67In Parabix-XML, the use of SIMD operations eliminates many branches.
     68Most conditional branches can be replaced with
     69bitwise operations, which can process up to 128 characters worth of
     70branches with one operation
     71or with a series of logical predicate operations, which are cheaper
     72to compute since they require only SIMD operations.
     73
     74As shown in Figure \ref{corei3_BR},
     75Parabix-XML is nearly branch free and exhibits minimal dependence on the
     76source markup density. Specifically, it experiences between 19.5 and
     7730.7 branch mispredictions per kB of XML data. Conversely, the cost of
    7678branch mispredictions for the Expat parser can be over 7 cycles per
    77 XML byte (see Figure \ref{corei3_BM}) ---this cost alone is higher
    78 than the average latency of a byte processed by Parabix.
     79XML byte (see Figure \ref{corei3_BM}) --- which exceeds
     80the average latency of a byte processed by Parabix-XML.
    7981
    8082
     
    8284
    8385\begin{figure}
    84 \subfigure[Branch Instructions]{
     86\subfigure[Branch Instructions / kB]{
    8587\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
    8688\label{corei3_BR}
    8789}
    8890\hfill
    89 \subfigure[Branch Misses]{
     91\subfigure[Branch Misses / kB]{
    9092\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
    9193\label{corei3_BM}
     
    9698\subsection{SIMD Instructions vs. Total Instructions}
    9799
    98 In Parabix, bit streams are both computed and
    99 predominately operated upon using the SIMD instructions of commodity
    100 processors.  The ratio of retired SIMD instructions to total
     100In Parabix-XML, the ratio of retired SIMD instructions to total
    101101instructions provides insight into the relative degree to which
    102 Parabix achieves parallelism over the byte-at-a-time approach.
    103 
    104 
    105 Using the Intel Pin tool, we gather the dynamic instruction mix for
    106 each XML workload, and classify instructions as either vector (SIMD)
    107 or non-vector instructions.  Figure~\ref{corei3_INS_p2} shows the
    108 percentage of SIMD instructions for the Parabix XML parser. The ratio of executed
    109 SIMD instructions over total instructions indicates the amount of
    110 parallel processing we were able to extract.
    111 %(Expat and Xerce do not use any SIMD instructions)
    112 The Parabix instruction mix is made up of 60\% to 80\% SIMD
    113 instructions.  The markup density of the files influence the number of
     102Parabix-XML achieves parallelism over the byte-at-a-time approach.
     103Using the Intel Pin tool, we gathered the dynamic instruction mix for
     104each XML workload and classified the instructions as either SIMD
     105or non-SIMD.  Figure~\ref{corei3_INS_p2} shows the
     106percentage of SIMD instructions in the Parabix-XML parser.
     107The ratio of executed SIMD instructions over total instructions indicates
     108the amount of available parallelism.
     109The resulting instruction mix consists of 60\% to 80\% SIMD
     110instructions. The markup density of the files influence the number of
    114111scalar instructions needed to handle the tag processing which affects
    115112the overall parallelism that can be extracted by Parabix.  We find
     
    119116%included in this portion of the study.
    120117
    121 % Parabix gains its performance by using parallel bitstreams, which
     118% Parabix gains its performance by using parallel bit streams, which
    122119% are mostly generated and calculated by SIMD instructions.  We use Intel
    123120% pin, a dynamic binary instrumentation tool, to gather instruction
     
    136133\subsection{CPU Cycles}
    137134
    138 Figure \ref{corei3_TOT} shows overall parser performance evaluated in
    139 terms of CPU cycles per kilobyte.  The Parabix parser  is 2.5$\times$
    140 to 4$\times$ faster on document-oriented input and 4.5 to 7 times faster
     135Figure \ref{corei3_TOT} shows overall parser performance in
     136terms of CPU cycles per kB. Parabix-XML  is 2.5
     137to 4$\times$ faster on document-oriented input and 4.5 to 7$\times$ faster
    141138on data-oriented input.  Traditional parsers can be dramatically
    142 slowed by dense markup, while Parabix is affected much less.  The
    143 results presented are not entirely fair to the Xerces parser since it
    144 first transcodes input from UTF-8 to UTF-16 before processing. In
    145 Xerces, this transcoding requires several cycles per byte.  However,
     139slowed by dense markup but Parabix-XML is relatively unaffected.
     140Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 before
     141processing it; this requires several cycles per byte. However,
    146142transcoding using parallel bit streams is significantly faster and
    147143requires less than a single cycle per byte.
     
    166162
    167163\subsection{Power and Energy}
    168 In this section, we study the power and energy consumption of Parabix
    169 in comparison with Expat and Xerces on \CITHREE{}. The average power
    170 of \CITHREE\ is about 21 watts. Figure \ref{corei3_power} shows the
    171 average power consumed by each parser.  Parabix, dominated by SIMD
    172 instructions which uses approximately 5\% additional power. While the
     164In this section, we study the power and energy consumption of Parabix-XML
     165in comparison with Expat and Xerces on \CITHREE{}.
     166Figure \ref{corei3_power} shows the
     167average power consumed by each parser. Parabix-XML, dominated by SIMD
     168instructions, uses $\sim5\%$ additional power. While the
    173169SIMD functional units are significantly wider than the scalar
    174 counterparts; register width and functional unit power account only
     170counterparts, register width and functional unit power account only
    175171for a small fraction of the overall power consumption in a processor
    176172pipeline. More importantly by using data parallel operations Parabix
Note: See TracChangeset for help on using the changeset viewer.