Changeset 1362 for docs


Ignore:
Timestamp:
Aug 23, 2011, 6:38:17 PM (8 years ago)
Author:
ashriram
Message:

first parse. Pipeline stage

Location:
docs/HPCA2012
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/04-methodology.tex

    r1339 r1362  
    1010
    1111To begin our study we propose to first investigate each of the XML
    12 parsers in terms of the Performance Monitoring Counter \footnote{Performance Monitoring Counters
    13  are special-purpose registers available with most modern
    14  microprocessors. PMCs store the running count of specific hardware
    15  events, such as retired instructions, cache misses, branch
    16  mispredictions, and arithmetic-logic unit operations.
    17  PMCs can be used to capture information about any program at
    18  run-time and under any workload at a fine granularity.} (PMC) hardware events listed in
    19 the PMC Hardware Events subsection. Based on the findings of previous
    20 work \cite{bellosa2001, bertran2010, bircher2007} we have chosen
    21 several key hardware performance events for which the authors indicate
    22 a strong correlation with energy consumption. In addition, we measure
    23 the runtime counts of SIMD instructions and
    24 bitwise operations using the Intel Pin binary instrumentation
    25 framework. Based on these data we gain further insight into XML
    26 parser execution characteristics and compare and constrast each of the Parabix parser versions
    27 against the performance of standard industry parsers.
     12parsers in terms of the Performance Monitoring Counter (PMC) hardware
     13events listed in the PMC Hardware Events subsection. Based on the
     14findings of previous work \cite{bellosa2001, bertran2010, bircher2007}
     15we have chosen several key hardware performance events for which the
     16authors indicate a strong correlation with overall performance and
     17energy consumption of the application. In addition, we measure the
     18runtime counts of SIMD instructions and bitwise operations using the
     19Intel Pin binary instrumentation framework. Based on these data we
     20gain further insight into XML parser execution characteristics and
     21compare and constrast each of the Parabix parser versions against the
     22performance of standard industry parsers.
    2823
    2924The foundational work by Bellosa in \cite{bellosa2001} as well as more
     
    4439\subsection{Parsers}\label{parsers}
    4540
    46 The XML parsing technologies selected for this study are the Parabix1, Parabix2,
    47 Xerces-C++, and Expat XML parsers. Parabix1 (parallel bit Streams for XML) is our first generation SIMD and Parallel Bit Stream technology based XML parser \cite{Parabix1}.
    48 Parabix1 leverages the processor built-in {\em bitscan} operation for high-performance XML character scanning as well as the
    49 SIMD capabilities of modern commodity processors to achieve high performance.
    50 Parabix2 \cite{parabix2} represents the second generation of the Parabix1 parser. Parabix2
    51 is an open-source XML parser that also leverages Parallel Bit Stream technology and the SIMD capabilities of
    52 modern commodity processors. However, Parabix2 differs from Parabix1 in that it employs new parallelization
    53 techniques, such as a multiple cursor approach to parallel parsing together with bit stream addition techniques to advance multiple cursors independently and in parallel. Parabix2 delivers
    54 dramatic performance improvements over traditional byte-at-a-time
    55 parsing technology.  Xerces-C++ version 3.1.1 (SAX) \cite{xerces} is a
    56 validating open source XML parser written in C++ by the Apache
    57 project.  Expat version 2.0.1 \cite{expat} is a non-validating XML
    58 parser library written in C.
     41The XML parsing technologies selected for this study are the Parabix1,
     42Parabix2, Xerces-C++, and Expat XML parsers. Parabix1 (parallel bit
     43Streams for XML) is our first generation SIMD and Parallel Bit Stream
     44technology based XML parser \cite{Parabix1}.  Parabix1 leverages the
     45processor built-in {\em bitscan} operation for high-performance XML
     46character scanning as well as the SIMD capabilities of modern
     47commodity processors to achieve high performance.  Parabix2
     48\cite{parabix2} represents the second generation of the Parabix1
     49parser. Parabix2 is an open-source XML parser that also leverages
     50Parallel Bit Stream technology and the SIMD capabilities of modern
     51commodity processors. However, Parabix2 differs from Parabix1 in that
     52it employs new parallelization techniques, such as a multiple cursor
     53approach to parallel parsing together with bit stream addition
     54techniques to advance multiple cursors independently and in
     55parallel. Parabix2 delivers dramatic performance improvements over
     56traditional byte-at-a-time parsing technology.  Xerces-C++ version
     573.1.1 (SAX) \cite{xerces} is a validating open source XML parser
     58written in C++ by the Apache project.  Expat version 2.0.1
     59\cite{expat} is a non-validating XML parser library written in C.
    5960
    6061\begin{table*}
  • docs/HPCA2012/09-pipeline.tex

    r1359 r1362  
    1 \section{Multi-threaded Parabix}
     1\section{Multithreaded Parabix}
    22\label{section:multithread}
    3 The general problem of addressing performance through multicore parallelism
    4 is the increasing energy cost. As discussed in previous sections,
    5 Parabix, which applies SIMD-based techniques can not only achieves better performance but consumes less energy.
    6 Moreover, using mulitiple cores, we can further improve the performance of Parabix while keeping the energy consumption at the same level.
     3The general challenge with boosting performance through multicore
     4parallelism is the power and energy overheads. When we put more cores
     5to work on a chip it results in proportionate increase in
     6power. Unfortunately, due to the runtime overheads associated with
     7thread management and data synchronization it is very hard to obtain
     8corresponding improvements in performance resulting in increased
     9energy costs.  Parabix, which exploits intra-core fine-grain
     10SIMD-based can improve performance and achieve corresponding
     11improvements in energy by improving the overall compute efficiency.
     12However, Parabix is restricted to the resources available within a
     13single core. In this section we have parallelized the Parabix XML
     14parser by hand to study the effects of thread level parallelism in
     15conjunction with Parabix's data parallelism.
    716
    8 A typical approach to parallelizing software, data parallelism,
    9 requires nearly independent data, However, the nature of XML files
    10 makes them hard to partition nicely for data parallelism.  Several
    11 approaches have been used to address this problem.  A preparsing phase
    12 has been proposed to help partition the XML document
    13 \cite{dataparallel}.  The goal of this preparsing is to determine the
    14 tree structure of the XML document so that it can be used to guide the
    15 full parsing in the next phase.  Another data parallel algorithm is
    16 called ParDOM \cite{Shah:2009}.  It first builds partial DOM node tree
    17 structures for each data segments and then link them using preorder
    18 numbers that has been assigned to each start element to determine the
    19 ordering among siblings and a stack to manage the parent-child
    20 relationship between elements.
    2117
    22 Data parallelism approaches introduce a lot of overheads to solve the
    23 data dependencies between segments.  Therefore, instead of
    24 partitioning the data into segments and assigning different data
    25 segments to different cores, we propose a pipeline parallelism
    26 strategy that partitions the process into several stages and let each
    27 core work with one single stage.
     18A typical approach to handling data parallelism with multiple threads,
     19requires partitioning data uniformly across the threads. However XML
     20parsing inherently moves through data in sequence, creates data
     21dependencies between the different phases and makes it hard to
     22partition the file. A few attempts have been made to address this
     23problem using a pre-parsing phase to  help determine the tree structure
     24and to partition the XML document~\cite{dataparallel}. There have been
     25other approaches to speculatively partition the data~\cite{Shah:2009}
     26but this requires introducing significant complexity in the overall
     27logic of the program.
    2828
    29 The interface between stages is implemented using a circular array,
    30 where each entry consists of all ten data structures for one segment
    31 as listed in Table \ref{pass_structure}.  Each thread keeps an index
    32 of the array ($I_N$), which is compared with the index ($I_{N-1}$)
    33 kept by its previous thread before processing the segment.  If $I_N$
    34 is smaller than $I_{N-1}$, thread N can start processing segment
    35 $I_N$, otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is
    36 larger than $I_N$.  The time consumed by continuously loading the
    37 value of $I_{N-1}$ and comparing it with $I_N$ will be later referred
    38 as stall time.  When a thread finishes processing the segment, it
    39 increases the index by one.
     29
     30We adopt a contrasting approach to parallelizing the Parabix XML
     31parser.  As described in Section~\ref{} Parabix consists of multiple
     32passes that on every chunk of input data and each of these stages
     33interact in sequence with no data movement from later to earlier
     34passes. This fits well into the mold of pipeline parallelism and we
     35partition the overall parser into several stages and assign a core to
     36each to stage. One of the key challenges is to evaluate which passes
     37need to be grouped into a stage. We analyzed the latency of each of
     38the passes in the single-threaded version of Parabix XML parser
     39(Column 1 in Table~\ref{pass_structure}) and assigned the passes them
     40to stages such that the stage latency of the overall pipeline in
     41balanced resulting in maximal throughput.
     42
     43
     44The interface between stages is implemented using a ring buffer, where
     45each entry consists of all ten data structures for one segment as
     46listed in Table \ref{pass_structure}.  Each pipeline stage S maintains
     47the index of the buffer entry ($I_S$) that is being processed. Before
     48processing the next buffer frame the stage check if the previous stage
     49is done by spinning on $I_{S-1}$ (Stage S-1's buffer entry). In
     50commodity multicore chips typically all threads share the last level
     51cache and if we let faster pipeline stages run ahead the data they
     52process will increase contention to the shared cache. To prevent this
     53we optimize how far the faster pipeline stages can run ahead by
     54controlling the overall size of the ring buffer. The faster stage if
     55it runs ahead will effectively cause the ring buffer to fill up and
     56cause the stage implicitly stall.
     57
    4058
    4159\begin{table*}[t]
     
    6785\end{table*}
    6886
    69 Figure \ref{multithread_perf} demonstrates the XML well-formedness checking performance of
    70 the multi-threaded Parabix in comparison with the single-threaded version.
    71 The multi-threaded Parabix is more than two times faster and runs at 2.7 cycles per input byte on the \SB{} machine.
     87
     88Figure~\ref{multithread_perf} demonstrates the performance improvement
     89achieved by pipelined Parabix in comparison with the single-threaded
     90version.  The multithreaded is $\simeq$2$\times$ faster compared to
     91the single threaded version and achieves $\simeq$ 2.7 cycles per input
     92byte by exploiting SIMD units of all \SB{}'s cores.  This performance  approaches the
     93performance of custom hardware solutions. Parabix demonstrates the
     94potential to enable an entire new class of applications, text
     95processing, to exploit multicores 
    7296
    7397
    74 Figure \ref{pipeline_power} shows the average power consumed by the multi-threaded Parabix in comparison with the single-threaded version.
    75 By running four threads and using all the cores at the same time, the power consumption of the multi-threaded Parabix is much higher
    76 than the single-threaded version. However, the energy consumption is about the same, because the multi-threaded Parabix needs less processing time.
    77 In fact, as shown in Figure \ref{pipeline_energy}, parsing soap.xml using multi-threaded Parabix consumes less energy than using single-threaded Parabix.
     98Figure \ref{pipeline_power} shows the average power consumed by the
     99multithreaded Parabix. Overall, as expected the power consumption of
     100the multithreaded Parabix increases in proportion to the number of
     101active cores. Note that the increase is not linear, since shared units
     102such as last-level-caches consume active power even if a single core
     103is active. Perhaps more interestingly we achieve corresponding
     104reduction in execution time which leads to same energy consumption as
     105the single-thread version (in some cases marginally less energy as shown for
     106soap.xml). 
    78107
    79108\begin{figure}
Note: See TracChangeset for help on using the changeset viewer.