Ignore:
Timestamp:
Aug 31, 2011, 6:40:45 PM (8 years ago)
Author:
ksherdy
Message:

edits

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/09-pipeline.tex

    r1407 r1412  
    11\section{Multithreaded Parabix}
    22\label{section:multithread}
    3 The general challenge with boosting performance through multicore
    4 parallelism is the power and energy overheads. When we put more cores
    5 to work on a chip it results in proportionate increase in
    6 power. Unfortunately, due to the runtime overheads associated with
    7 thread management and data synchronization it is very hard to obtain
     3Even if an application is infinitely parallelizable and thread
     4synchronization costs are non-existent, all applications are constrained by
     5the power and energy overheads incurred when utilizing multiple cores:
     6as more cores are put to work, a proportional increase in power occurs.
     7Unfortunately, due to the runtime overheads associated with
     8thread management and data synchronization, it is very hard to obtain
    89corresponding improvements in performance resulting in increased
    9 energy costs.  Parabix, which exploits intra-core fine-grain
    10 SIMD-based can improve performance and achieve corresponding
    11 improvements in energy by improving the overall compute efficiency.
    12 However, Parabix is restricted to the resources available within a
    13 single core. In this section we have parallelized the Parabix XML
    14 parser by hand to study the effects of thread level parallelism in
    15 conjunction with Parabix's data parallelism.
     10energy costs. Parabix-XML can improve performance and reduce energy consumption
     11by improving the overall computation efficiency.
     12However, up to this point, we restricted Parabix-XML to a
     13single core. In this section, we discuss our parallelized version of Parabix-XML
     14to study the effects of thread-level parallelism in
     15conjunction with Parabix-XML's data parallelism.
    1616
    1717
    18 A typical approach to handling data parallelism with multiple threads,
    19 requires partitioning data uniformly across the threads. However XML
    20 parsing inherently moves through data in sequence, creates data
    21 dependencies between the different phases and makes it hard to
    22 partition the file. A few attempts have been made to address this
    23 problem using a pre-parsing phase to  help determine the tree structure
    24 and to partition the XML document~\cite{dataparallel}. There have been
    25 other approaches to speculatively partition the data~\cite{Shah:2009}
    26 but this requires introducing significant complexity in the overall
    27 logic of the program.
     18The typical approach to handling data parallelism with multiple threads
     19involves partitioning data uniformly across the threads. However XML
     20parsing is inherently sequential, which makes it difficult to
     21partition the data. Several attempts have been made to address this
     22problem using a preparsing phase to help determine the tree structure
     23and to partition the XML document accordingly~\cite{dataparallel}.
     24Another approach involved speculatively partitioning the data~\cite{Shah:2009} but
     25this introduced a significant level of complexity into the overall logic of the program.
    2826
    2927\begin{table*}[!h]
     
    3432\begin{tabular}{|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|}
    3533\hline
    36         &      & & \multicolumn{10}{|c|}{Data Structures}\\ \hline
     34        &      & & \multicolumn{10}{|c|}{Data Structure Flow / Dependencies}\\ \hline
    3735        &      &                & data\_buffer& basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & err\_streams\\ \hline
    3836        & latency(C/B)     & size (B)       & 128         & 128         & 496  & 448   & 80    & 176    & 112    & 176    & 16         & 112           \\ \hline
     
    5553\end{table*}
    5654
    57 
    58 We adopt a contrasting approach to parallelizing the Parabix XML
    59 parser.  As described in Section~\ref{section:parser} Parabix consists of multiple
    60 passes that on every chunk of input data and each of these stages
    61 interact in sequence with no data movement from later to earlier
    62 passes. This fits well into the mold of pipeline parallelism and we
    63 partition the overall parser into several stages and assign a core to
    64 each to stage. One of the key challenges is to evaluate which passes
    65 need to be grouped into a stage. We analyzed the latency of each of
    66 the passes in the single-threaded version of Parabix XML parser
    67 (Column 1 in Table~\ref{pass_structure}) and assigned the passes them
    68 to stages such that the stage latency of the overall pipeline in
    69 balanced resulting in maximal throughput.
     55In contrast to those methods, we adopted a parallelism strategy that requires
     56neither speculation nor pre-parsing.
     57As described in Section~\ref{section:parser}, Parabix-XML consists of multiple
     58passes that, on every chunk of input data, interact with each other
     59in sequence with no data movement from later to earlier
     60passes. This fits well into the mold of pipeline parallelism. We
     61partitioned Parabix-XML into four stages and assigned a core to
     62each to stage. One of the key challenges was to determine which passes
     63should be grouped together. By analyzing the latency and data dependencies of each of
     64the passes in the single-threaded version of Parabix-XML
     65(Column 1 in Table~\ref{pass_structure}), and assigned the passes
     66to stages such that that provided the maximal throughput.
    7067
    7168
    7269The interface between stages is implemented using a ring buffer, where
    7370each entry consists of all ten data structures for one segment as
    74 listed in Table \ref{pass_structure}.  Each pipeline stage S maintains
     71listed in Table \ref{pass_structure}.  Each pipeline stage $S$ maintains
    7572the index of the buffer entry ($I_S$) that is being processed. Before
    7673processing the next buffer frame the stage check if the previous stage
    77 is done by spinning on $I_{S-1}$ (Stage S-1's buffer entry). In
     74is done by spinning on $I_{S-1}$ (Stage $S-1$'s buffer entry). In
    7875commodity multicore chips typically all threads share the last level
    79 cache and if we let faster pipeline stages run ahead the data they
     76cache. If we let the faster pipeline stages run ahead, the data they
    8077process will increase contention to the shared cache. To prevent this
    81 we optimize how far the faster pipeline stages can run ahead by
    82 controlling the overall size of the ring buffer. The faster stage if
    83 it runs ahead will effectively cause the ring buffer to fill up and
    84 cause the stage implicitly stall.
     78we limit how far the faster pipeline stages can run ahead by
     79controlling the overall size of the ring buffer. Whenever a faster stage
     80runs ahead, it will effectively cause the ring buffer to fill up and
     81force that stage to stall.
    8582
    8683
    8784
    8885Figure~\ref{multithread_perf} demonstrates the performance improvement
    89 achieved by pipelined Parabix in comparison with the single-threaded
    90 version.  The multithreaded is $\simeq$2$\times$ faster compared to
    91 the single threaded version and achieves $\simeq$ 2.7 cycles per input
    92 byte by exploiting SIMD units of all \SB{}'s cores.  This performance  approaches the
    93 performance of custom hardware solutions. Parabix demonstrates the
     86achieved by pipelined Parabix-XML in comparison with the single-threaded
     87version.  The 4-threaded version is $\simeq2\times$ faster compared to
     88the single threaded version and achieves $\simeq2.7$ cycles per input
     89byte by exploiting SIMD units of all \SB{}'s cores.  This performance approaches the
     901 cycle per byte
     91performance of custom hardware solutions \ref{DaiNiZhu2010}. Parabix demonstrates the
    9492potential to enable an entire new class of applications, text
    9593processing, to exploit multicores.
     
    9795
    9896Figure \ref{pipeline_power} shows the average power consumed by the
    99 multithreaded Parabix. Overall, as expected the power consumption of
    100 the multithreaded Parabix increases in proportion to the number of
     97multithreaded Parabix. Overall, as expected the power consumption
     98increases in proportion to the number of
    10199active cores. Note that the increase is not linear, since shared units
    102 such as last-level-caches consume active power even if a single core
    103 is active. Perhaps more interestingly we achieve corresponding
    104 reduction in execution time which leads to same energy consumption as
     100such as last-level-caches consume active power even if only one core
     101is active. Perhaps more interestingly there was a corresponding
     102reduction in execution time, which leads to a similar energy consumption as
    105103the single-thread version (in some cases marginally less energy as shown for
    106104soap.xml). 
    107105
    108106\begin{figure}
    109 \subfigure[Performance (Cycles / kByte)]{
     107\subfigure[Performance (Cycles / kB)]{
    110108\includegraphics[width=0.32\textwidth]{plots/pipeline_performance.pdf}
    111109\label{pipeline_performance}
Note: See TracChangeset for help on using the changeset viewer.