Changeset 1380


Ignore:
Timestamp:
Aug 25, 2011, 1:56:51 PM (8 years ago)
Author:
ashriram
Message:

Done evaluation

Location:
docs/HPCA2012
Files:
6 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/03b-research.tex

    r1372 r1380  
    1111
    1212
    13 Figure \ref{parabix_arch} shows the overall structure of the Parabix XML parser set up for
    14 well-formedness checking.
    15 The input file is processed using 11 functions organized into 7 modules. 
    16 In the first module, the Read\_Data function loads data blocks from an input file to data\_buffer.
    17 The data is then transposed to eight parallel basis bitstreams (basis\_bits) in the Transposition module.
    18 The eight bitstreams are used in the Classification function to generate all the XML lexical item streams (lex)
    19 as well as in the U8\_Validation module to validate UTF-8 characters.
    20 The lexical item streams and scope streams (scope) that are generated in Gen\_Scope function
    21 are supplied to the parsing module, which consists three functions, Parse\_CtCDPI, Parse\_Ref and Parse\_tag.
    22 These functions deal with the parsing of
    23 comments, CDATA sections, processing instructions, references and tags.   After this,
    24 information is gathered by Name\_Validation and Err\_Check functions, producing
    25 name check streams and error streams.  These are then passed to the final module for Postprocessing.
    26 All the possible errors that cannot be conveniently detected by bitstreams are checked in this last module.
    27 The final output reports any well-formedness error detected and its position within the input file.
     13Figure \ref{parabix_arch} shows the overall structure of the Parabix
     14XML parser set up for well-formedness checking.  The input file is
     15processed using 11 functions organized into 7 modules.  In the first
     16module, the Read\_Data function loads data blocks from an input file
     17to data\_buffer.  The data is then transposed to eight parallel basis
     18bitstreams (basis\_bits) in the Transposition module.  The eight
     19bitstreams are used in the Classification function to generate all the
     20XML lexical item streams (lex) as well as in the U8\_Validation module
     21to validate UTF-8 characters.  The lexical item streams and scope
     22streams (scope) that are generated in Gen\_Scope function are supplied
     23to the parsing module, which consists three functions, Parse\_CtCDPI,
     24Parse\_Ref and Parse\_tag.  These functions deal with the parsing of
     25comments, CDATA sections, processing instructions, references and
     26tags.  After this, information is gathered by Name\_Validation and
     27Err\_Check functions, producing name check streams and error streams.
     28These are then passed to the final module for Postprocessing.  All the
     29possible errors that cannot be conveniently detected by bitstreams are
     30checked in this last module.  The final output reports any
     31well-formedness error detected and its position within the input file.
    2832
    29 Within this structure, all functions in the four shaded modules consist entirely of parallel bit stream
    30 operations.  Of these, the Classification function consists of XML character class definitions that
    31 are generated using ccc, while much of the U8\_Validation similarly consists of UTF-8 byte class
    32 definitions that are also generated by ccc.  The remainder of these functions are programmed using
    33 our unbounded bitstream language following the logical requirements of XML parsing.   All the functions
    34 in the four shaded modules are then compiled to low-level C/C++ code using our Pablo compiler.   This
    35 code is then linked in with the general Transposition code available in the Parabix run-time library,
    36 as well as the hand-written Postprocessing code that completes the well-formed checking.
     33Within this structure, all functions in the four shaded modules
     34consist entirely of parallel bit stream operations.  Of these, the
     35Classification function consists of XML character class definitions
     36that are generated using ccc, while much of the U8\_Validation
     37similarly consists of UTF-8 byte class definitions that are also
     38generated by ccc.  The remainder of these functions are programmed
     39using our unbounded bitstream language following the logical
     40requirements of XML parsing.  All the functions in the four shaded
     41modules are then compiled to low-level C/C++ code using our Pablo
     42compiler.  This code is then linked in with the general Transposition
     43code available in the Parabix run-time library, as well as the
     44hand-written Postprocessing code that completes the well-formed
     45checking.
  • docs/HPCA2012/05-corei3.tex

    r1378 r1380  
    147147requires less than a single cycle per byte.
    148148
    149 \begin{figure}[b]
    150 \subfigure[Instruction Breakdown (\% SIMD Instructions)]{
    151 \includegraphics[width=0.5\textwidth]{plots/corei3_INS_p2.pdf}
     149\begin{figure}[htbp]
     150\begin{minipage}{0.5\linewidth}
     151\centering
     152\includegraphics[width=\textwidth]{plots/corei3_INS_p2.pdf}
     153\caption{Instruction Breakdown (\% SIMD Instructions)}
    152154\label{corei3_INS_p2}
    153 }
     155\end{minipage}%
    154156\hfill
    155 \subfigure[Performance (CPU Cycles per kB)]{
    156 \includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
     157\begin{minipage}{0.5\linewidth}
     158\centering
     159\includegraphics[width=\textwidth]{plots/corei3_TOT.pdf}
     160\caption{Performance (CPU Cycles per kB)}
    157161\label{corei3_TOT}
    158 }
    159 \end{figure}
     162\end{minipage}
     163\end{figure}
     164
    160165
    161166
    162167\subsection{Power and Energy}
    163 In this section, we study the power and energy consumption of Parabix in
    164 comparison with Expat and Xerces on \CITHREE{}. The average power of
    165 \CITHREE\ is about 21 watts. Figure \ref{corei3_power} shows the
     168In this section, we study the power and energy consumption of Parabix
     169in comparison with Expat and Xerces on \CITHREE{}. The average power
     170of \CITHREE\ is about 21 watts. Figure \ref{corei3_power} shows the
    166171average power consumed by each parser.  Parabix, dominated by SIMD
    167172instructions which uses approximately 5\% additional power. While the
     
    171176pipeline. More importantly by using data parallel operations Parabix
    172177amortizes the fetch and data access overheads. This results in minimal
    173 power increase compared to the conventional parsers. 
    174 Perhaps the energy trends shown in Figure
    175 \ref{corei3_energy} reveal an interesting trend. Parabix consumes
    176 substantially less energy than the other parsers. Parabix consumes 50
    177 to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and
    178 140nJ to 370nJ per byte respectively.  Although Parabix
    179 requires slightly more power (per instruction), the processing time of
    180 Parabix is significantly lower.
     178power increase compared to the conventional parsers.  Perhaps the
     179energy trends shown in Figure \ref{corei3_energy} reveal an
     180interesting trend. Parabix consumes substantially less energy than the
     181other parsers. Parabix consumes 50 to 75 nJ per byte while Expat and
     182Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively.
     183Although Parabix requires slightly more power (per instruction), the
     184processing time of Parabix is significantly lower.
    181185
    182186
     
    195199\label{corei3_energy}
    196200}
    197 \end{figure}
    198 
    199 
     201\caption{Power profile of Parabix on \CITHREE{}}
     202\end{figure}
     203
     204
  • docs/HPCA2012/06-scalability.tex

    r1370 r1380  
    1 \section{Scalability}
     1\section{Parabix on various hardware}
    22\label{section:scalability}
    33\subsection{Performance}
    4 Figure \ref{Scalability} (a) demonstrates the average XML
    5 well-formedness checking performance of Parabix2 for each of the
    6 workloads and as executed on each of the processor cores --- \CO\,
    7 \CITHREE\ and \SB{}.  Processing time is shown in terms of bit stream
    8 based operations executed in `bit-space' and postprocessing operations
    9 executed in `byte-space'.  In the Parabix2 parser, bit-space parallel
    10 bit stream parser operations consist primarily of SIMD instructions;
    11 byte-space operations consist of byte comparisons across arrays of
    12 values. Executing Parabix2 on \CITHREE{} over \CO\ results in an
    13 average performance improvement of 17\% in bit stream processing
    14 whereas migrating Parabix2 from \CITHREE{} to \SB{} results in a 22\%
    15 average performance gain. Bit space measurements are stable and
    16 consistent across each of the source inputs and cores. Postprocessing
    17 operations demonstrate data dependent variance. Performance gains from
    18 18\% to 31\% performance are observered in migrating Parabix2 from
    19 \CO\ to \CITHREE{}; 0\% to 17\% performance from \CITHREE\ to
    20 \SB{}. For the purpose of comparison, Figure \ref{Scalability} (b)
    21 shows the performance of the Expat parser on each of the processor
    22 cores.  A performance improvement of less than 5\% is observed when
    23 executing Expat on \CITHREE\ over \CO\ and less than 10\% on \SB\ over
    24 \CITHREE{}.
     4In this section, we study the performance of the XML parsers across
     5three generations of intel architectures.  Figure \ref{Scalability}
     6(a) shows the average execution time of Parabix.  We analyze the
     7execution time in terms of SIMD operations that operate on bitstreams
     8(\textit{bit-space}) and scalar operations that perform post
     9processing on the original character bytes.  In Parabix a significant
     10fraction of the overall execution time is spent in SIMD operations. 
    2511
    26 Overall, Parabix2 scales better than Expat. Simply executing identical
    27 Parabix2 object code on \SB\ results in an overall performance
    28 improvement up to 26\%. Additional performance aspects of Parabix2 on
    29 \SB\ with AVX instructions are discussed in the following sections.
     12Our results demonstrate that Parabix's optimizations are complementary
     13to hardware improvements and seem to further improve the efficiency of
     14newer microarchitectures.  For Parabix's bit-stream processing,
     15\CITHREE{} results in an 40\% performance improvement over \CO{},
     16whereas \SB{} results in a 20\% improvement compared to
     17\CITHREE{}. The improvements in the bit-space SIMD operations is
     18stable across the different input files. Postprocessing operations
     19demonstrate data dependent variance. \CITHREE{} gains between
     2027\%---40\% compared to \CO{} and \SB{} gains between 16\%---39\%
     21compared to \CITHREE{}. For the purpose of comparison, Figure
     22\ref{Scalability} (b) shows the performance of the Expat parser;
     23\CITHREE\ improves performance only by 5\% over \CO\ while \SB\
     24improves performance by less than 10\% over\CITHREE{}. Not that the
     25gains of \CITHREE\ over \CO\ includes an improvement both in the clock
     26frequency and microarchitecture improvements while \SB{}'s gains can
     27be mainly attributed to the architecture.
     28
     29Figure \ref{power_Parabix2} shows the average power consumption of
     30Parabix over each workload and as executed on each of the processor
     31cores --- \CO{}, \CITHREE\ and \SB{}.  Overall the last three
     32generation of processors seem to bring with them 25---30\% improvement
     33in power consumption with every generation. Parabix on \SB\ consumes
     34less than 15W.  Overall, Parabix on \SB\ consumes 72\% to 75\% less
     35energy than \CO{}.
     36
    3037
    3138\begin{figure}
     
    4148\end{figure}
    4249
    43 
    44 \subsection{Power and Energy}
    45 
    46 Figure \ref{power_Parabix2} shows the average power consumption of
    47 Parabix2 over each workload and as executed on each of the processor
    48 cores --- \CO{}, \CITHREE\ and \SB{}.  Average power consumption on
    49 \CO{} is 32 watts. Execution on \CITHREE\ results in 30\% power saving
    50 over \CO{}.  \SB\ saves 25\% of the power compared with \CITHREE\ and
    51 consumes only 15 watts.
    52 
    53 In XML parsing we observe energy consumption is dependent on processing time. That is, a reduction in processing time results in a directly proportional reduction in energy consumption.
    54 With newer processor cores comes improvements in application performance. As a result, Parabix2 executed on \SB\ consumes 72\% to 75\% less energy than Parabix2 on \CO{}.
    55 
    56 
    57 
    58 
    5950\begin{figure}
    6051\centering
     
    6960\label{energy_Parabix2}
    7061}
     62\caption{Energy Profile of Parabix on various hardware platforms}
    7163\end{figure}
     64
     65
     66\def\CORTEXA8{Cortex-A8}
     67
     68\subsection{Parabix on Mobile processors}
     69\label{section:neon}
     70Our experience with the generation of Intel processors led us to
     71contemplate about mobile processors such as the ARM \CORTEXA8\ which
     72also includes SIMD units.  ARM NEON makes available a 128-bit SIMD
     73instruction set similar in functionality to Intel SSE3 instruction
     74set. In this section, we present our performance comparison of a
     75NEON-based port of Parabix versus the Expat parser. Xerces is excluded
     76from this portion of our study due to the complexity of the
     77cross-platform build process for C++ applications.
     78
     79The platform we use is the Samsung Galaxy Android Tablet that houses a
     80Samsung S5PC110 ARM \CORTEXA8{} 1Ghz single-core, dual-issue,
     81superscalar microprocessor. It includes a 32kB L1 data cache and a
     82512kB L2 shared cache.  Migration of Parabix to the Android platform
     83began with the retargeting of a subset of the Parabix SIMD library
     84for ARM NEON.  The majority of the Parabix SIMD functionality ported
     85directly. However, for a small subset of the SIMD functions (e.g., bit
     86packing) of NEON equivalents did not exist. In such cases we simply
     87emulated logical equivalent instructions using the available the
     88scalar instruction set. This library code was cross-compiled for
     89Android using the Android NDK.
     90
     91A comparison of Figure \ref{arm_processing_time} and Figure
     92\ref{corei3_TOT} demonstrates that the performance of both Parabix and
     93Expat degrades substantially on \CORTEXA8{} (?$\times$---?$\times$).
     94This result was expected given the comparably performance limited
     95\CORTEXA8{}.  Surprisingly, on \CORTEXA8{}, Expat outperforms Parabix
     96on each of the lower markup density workloads, dew.xml and jaw.xml. On
     97the remaining higher-density workloads, Parabix performs only
     98moderately better than Expat.  Investigating causes for this
     99performance degradation for Parabix led us to investigate the latency
     100of Neon SIMD operations.
     101
     102
     103
     104Figure \ref{relative_performance_arm} investigates the performance of
     105Expat and Parabix for the various input workloads on the \CORTEXA8{};
     106Figure~\ref{relative_performance_intel} plots the performance for
     107\CITHREE{}. The results demonstrate that that the execution time of
     108each parser varies in a linear fashion with respect to the markup
     109density of the file. On the both \CORTEXA8{} and \CITHREE{} both
     110parsers demonstrate the same trend. For lower mark up density files
     111for which the fraction of SIMD operations and hence the potential for
     112parallelism is limited, the overheads of SIMD instructions affect
     113overall execution time. Figure~\ref{relative_performance_arm} provides
     114insight into the problem, Parabix's performance is hindered by SIMD
     115instruction latency for low markup density files; it appears that the
     116latency of SIMD operations is relatively higher on the \CORTEXA8{}
     117processor.  This is possibly because the Neon SIMD extensions are
     118implemented as a coprocessor on \CORTEXA8{} which imposes higher
     119overhead for applications that frequently inter-operate between scalar
     120and SIMD registers. Future performance enhancement to ARM NEON that
     121implement the Neon within the core microarchitecture could
     122substantially improve the efficiency of Parabix.
     123
     124
     125\begin{figure}
     126\subfigure[ARM Neon Performance]{
     127\includegraphics[width=0.3\textwidth]{plots/arm_TOT.pdf}
     128\label{arm_processing_time}
     129}
     130\hfill
     131\subfigure[ARM Neon]{
     132\includegraphics[width=0.32\textwidth]{plots/Markup_density_Arm.pdf}
     133\label{relative_performance_arm}
     134}
     135\hfill
     136\subfigure[Core i3]{
     137\includegraphics[width=0.32\textwidth]{plots/Markup_density_Intel.pdf}
     138\label{relative_performance_intel}
     139}
     140\caption{Parabix performance on mobile platforms}
     141\end{figure}
     142
     143
     144
  • docs/HPCA2012/09-pipeline.tex

    r1362 r1380  
    2929
    3030We adopt a contrasting approach to parallelizing the Parabix XML
    31 parser.  As described in Section~\ref{} Parabix consists of multiple
     31parser.  As described in Section~\ref{section:parser} Parabix consists of multiple
    3232passes that on every chunk of input data and each of these stages
    3333interact in sequence with no data movement from later to earlier
  • docs/HPCA2012/main.tex

    r1363 r1380  
    177177\input{06-scalability.tex}
    178178\input{07-avx.tex}
    179 \input{08-arm.tex}
     179%\input{08-arm.tex}
    180180\input{09-pipeline.tex}
    181181\input{10-related.tex}
Note: See TracChangeset for help on using the changeset viewer.