Aug 23, 2011, 6:38:17 PM (8 years ago)

first parse. Pipeline stage

1 edited


  • docs/HPCA2012/09-pipeline.tex

    r1359 r1362  
    1 \section{Multi-threaded Parabix}
     1\section{Multithreaded Parabix}
    3 The general problem of addressing performance through multicore parallelism
    4 is the increasing energy cost. As discussed in previous sections,
    5 Parabix, which applies SIMD-based techniques can not only achieves better performance but consumes less energy.
    6 Moreover, using mulitiple cores, we can further improve the performance of Parabix while keeping the energy consumption at the same level.
     3The general challenge with boosting performance through multicore
     4parallelism is the power and energy overheads. When we put more cores
     5to work on a chip it results in proportionate increase in
     6power. Unfortunately, due to the runtime overheads associated with
     7thread management and data synchronization it is very hard to obtain
     8corresponding improvements in performance resulting in increased
     9energy costs.  Parabix, which exploits intra-core fine-grain
     10SIMD-based can improve performance and achieve corresponding
     11improvements in energy by improving the overall compute efficiency.
     12However, Parabix is restricted to the resources available within a
     13single core. In this section we have parallelized the Parabix XML
     14parser by hand to study the effects of thread level parallelism in
     15conjunction with Parabix's data parallelism.
    8 A typical approach to parallelizing software, data parallelism,
    9 requires nearly independent data, However, the nature of XML files
    10 makes them hard to partition nicely for data parallelism.  Several
    11 approaches have been used to address this problem.  A preparsing phase
    12 has been proposed to help partition the XML document
    13 \cite{dataparallel}.  The goal of this preparsing is to determine the
    14 tree structure of the XML document so that it can be used to guide the
    15 full parsing in the next phase.  Another data parallel algorithm is
    16 called ParDOM \cite{Shah:2009}.  It first builds partial DOM node tree
    17 structures for each data segments and then link them using preorder
    18 numbers that has been assigned to each start element to determine the
    19 ordering among siblings and a stack to manage the parent-child
    20 relationship between elements.
    22 Data parallelism approaches introduce a lot of overheads to solve the
    23 data dependencies between segments.  Therefore, instead of
    24 partitioning the data into segments and assigning different data
    25 segments to different cores, we propose a pipeline parallelism
    26 strategy that partitions the process into several stages and let each
    27 core work with one single stage.
     18A typical approach to handling data parallelism with multiple threads,
     19requires partitioning data uniformly across the threads. However XML
     20parsing inherently moves through data in sequence, creates data
     21dependencies between the different phases and makes it hard to
     22partition the file. A few attempts have been made to address this
     23problem using a pre-parsing phase to  help determine the tree structure
     24and to partition the XML document~\cite{dataparallel}. There have been
     25other approaches to speculatively partition the data~\cite{Shah:2009}
     26but this requires introducing significant complexity in the overall
     27logic of the program.
    29 The interface between stages is implemented using a circular array,
    30 where each entry consists of all ten data structures for one segment
    31 as listed in Table \ref{pass_structure}.  Each thread keeps an index
    32 of the array ($I_N$), which is compared with the index ($I_{N-1}$)
    33 kept by its previous thread before processing the segment.  If $I_N$
    34 is smaller than $I_{N-1}$, thread N can start processing segment
    35 $I_N$, otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is
    36 larger than $I_N$.  The time consumed by continuously loading the
    37 value of $I_{N-1}$ and comparing it with $I_N$ will be later referred
    38 as stall time.  When a thread finishes processing the segment, it
    39 increases the index by one.
     30We adopt a contrasting approach to parallelizing the Parabix XML
     31parser.  As described in Section~\ref{} Parabix consists of multiple
     32passes that on every chunk of input data and each of these stages
     33interact in sequence with no data movement from later to earlier
     34passes. This fits well into the mold of pipeline parallelism and we
     35partition the overall parser into several stages and assign a core to
     36each to stage. One of the key challenges is to evaluate which passes
     37need to be grouped into a stage. We analyzed the latency of each of
     38the passes in the single-threaded version of Parabix XML parser
     39(Column 1 in Table~\ref{pass_structure}) and assigned the passes them
     40to stages such that the stage latency of the overall pipeline in
     41balanced resulting in maximal throughput.
     44The interface between stages is implemented using a ring buffer, where
     45each entry consists of all ten data structures for one segment as
     46listed in Table \ref{pass_structure}.  Each pipeline stage S maintains
     47the index of the buffer entry ($I_S$) that is being processed. Before
     48processing the next buffer frame the stage check if the previous stage
     49is done by spinning on $I_{S-1}$ (Stage S-1's buffer entry). In
     50commodity multicore chips typically all threads share the last level
     51cache and if we let faster pipeline stages run ahead the data they
     52process will increase contention to the shared cache. To prevent this
     53we optimize how far the faster pipeline stages can run ahead by
     54controlling the overall size of the ring buffer. The faster stage if
     55it runs ahead will effectively cause the ring buffer to fill up and
     56cause the stage implicitly stall.
    69 Figure \ref{multithread_perf} demonstrates the XML well-formedness checking performance of
    70 the multi-threaded Parabix in comparison with the single-threaded version.
    71 The multi-threaded Parabix is more than two times faster and runs at 2.7 cycles per input byte on the \SB{} machine.
     88Figure~\ref{multithread_perf} demonstrates the performance improvement
     89achieved by pipelined Parabix in comparison with the single-threaded
     90version.  The multithreaded is $\simeq$2$\times$ faster compared to
     91the single threaded version and achieves $\simeq$ 2.7 cycles per input
     92byte by exploiting SIMD units of all \SB{}'s cores.  This performance  approaches the
     93performance of custom hardware solutions. Parabix demonstrates the
     94potential to enable an entire new class of applications, text
     95processing, to exploit multicores 
    74 Figure \ref{pipeline_power} shows the average power consumed by the multi-threaded Parabix in comparison with the single-threaded version.
    75 By running four threads and using all the cores at the same time, the power consumption of the multi-threaded Parabix is much higher
    76 than the single-threaded version. However, the energy consumption is about the same, because the multi-threaded Parabix needs less processing time.
    77 In fact, as shown in Figure \ref{pipeline_energy}, parsing soap.xml using multi-threaded Parabix consumes less energy than using single-threaded Parabix.
     98Figure \ref{pipeline_power} shows the average power consumed by the
     99multithreaded Parabix. Overall, as expected the power consumption of
     100the multithreaded Parabix increases in proportion to the number of
     101active cores. Note that the increase is not linear, since shared units
     102such as last-level-caches consume active power even if a single core
     103is active. Perhaps more interestingly we achieve corresponding
     104reduction in execution time which leads to same energy consumption as
     105the single-thread version (in some cases marginally less energy as shown for
Note: See TracChangeset for help on using the changeset viewer.