Ignore:
Timestamp:
Aug 19, 2011, 4:57:57 PM (8 years ago)
Author:
ashriram
Message:

New Intro New title

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2011/09-pipeline.tex

    r1325 r1326  
    11\section{Multi-threaded Parabix}
    2 The general problem of addressing performance through multicore parallelism
    3 is the increasing energy cost. As discussed in previous sections,
    4 Parabix, which applies SIMD-based techniques can not only achieves better performance but consumes less energy.
    5 Moreover, using mulitiple cores, we can further improve the performance of Parabix while keeping the energy consumption at the same level.
    6 
    7 A typical approach to parallelizing software, data parallelism, requires nearly independent data,
    8 However, the nature of XML files makes them hard to partition nicely for data parallelism.
    9 Several approaches have been used to address this problem.
    10 A preparsing phase has been proposed to help partition the XML document \cite{dataparallel}.
    11 The goal of this preparsing is to determine the tree structure of the XML document
    12 so that it can be used to guide the full parsing in the next phase.
    13 Another data parallel algorithm is called ParDOM \cite{Shah:2009}.
    14 It first builds partial DOM node tree structures for each data segments and then link them
    15 using preorder numbers that has been assigned to each start element to determine the ordering among siblings
    16 and a stack to manage the parent-child relationship between elements.
    17 
    18 Theses data parallelism approaches introduce a lot of overheads to solve the data dependencies between segments.
    19 Therefore, instead of partitioning the data and assigning different data segments to different cores,
    20 we propose a pipeline parallelism strategy that partitions the process into several stages and let each core work with one single stage.
    21 
    22 The interface between stages is implemented using a circular array,
    23 where each entry consists of all ten data structures for one segment as listed in Table \ref{pass_structure}.
    24 Each thread keeps an index of the array ($I_N$),
    25 which is compared with the index ($I_{N-1}$) kept by its previous thread before processing the segment.
    26 If $I_N$ is smaller than $I_{N-1}$, thread N can start processing segment $I_N$,
    27 otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is larger than $I_N$.
    28 The time consumed by continuously loading the value of $I_{N-1}$ and
    29 comparing it with $I_N$ will be later referred as stall time.
    30 When a thread finishes processing the segment, it increases the index by one.
    312
    323\begin{table*}[t]
     
    345\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|}
    356\hline
    36        & & \multicolumn{10}{|c|}{Data Structures}\\ \hline
    37        &                & srcbuf & basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & check\_streams\\ \hline
    38 Stage1 &fill\_buffer    & write  &             &      &       &       &        &        &        &            &               \\
    39        &s2p             & read   & write       &      &       &       &        &        &        &            &               \\
    40        &classify\_bytes &        & read        &      & write &       &        &        &        &            &               \\ \hline
    41 Stage2 &validate\_u8    &        & read        & write&       &       &        &        &        &            &               \\
    42        &gen\_scope      &        &             &      & read  & write &        &        &        &            &               \\
    43        &parse\_CtCDPI   &        &             &      & read  & read  & write  &        &        &            & write         \\
    44        &parse\_ref      &        &             &      & read  & read  & read   & write  &        &            &               \\ \hline
    45 Stage3 &parse\_tag      &        &             &      & read  & read  & read   &        & write  &            &               \\
    46        &validate\_name  &        &             & read & read  &       & read   & read   & read   & write      & write         \\
    47        &gen\_check      &        &             & read & read  & read  & read   &        & read   & read       & write         \\ \hline
    48 Stage4 &postprocessing  & read   &             &      & read  &       & read   & read   &        &            & read          \\ \hline
     7Stage Name & \multicolumn{10}{|c|}{Data Structures}\\ \hline
     8                & srcbuf & basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & check\_streams\\ \hline
     9fill\_buffer    & write  &             &      &       &       &        &        &        &            &               \\ \hline
     10s2p             & read   & write       &      &       &       &        &        &        &            &               \\ \hline
     11classify\_bytes &        & read        &      & write &       &        &        &        &            &               \\ \hline
     12validate\_u8    &        & read        & write&       &       &        &        &        &            &               \\ \hline
     13gen\_scope      &        &             &      & read  & write &        &        &        &            &               \\ \hline
     14parse\_CtCDPI   &        &             &      & read  & read  & write  &        &        &            & write         \\ \hline
     15parse\_ref      &        &             &      & read  & read  & read   & write  &        &            &               \\ \hline
     16parse\_tag      &        &             &      & read  & read  & read   &        & write  &            &               \\ \hline
     17validate\_name  &        &             & read & read  &       & read   & read   & read   & write      & write         \\ \hline
     18gen\_check      &        &             & read & read  & read  & read   &        & read   & read       & write         \\ \hline
     19postprocessing  & read   &             &      & read  &       & read   & read   &        &            & read          \\ \hline
    4920\end{tabular}
    5021\end{center}
     
    5324\end{table*}
    5425
    55 Figure \ref{multithread_perf} demonstrates the XML well-formedness checking performance of
    56 the multi-threaded Parabix in comparison with the single-threaded version.
    57 The multi-threaded Parabix is more than two times faster and runs at 2.7 cycles per input byte on the \SB{} machine.
    5826
    5927\begin{figure}
     
    6230\end{center}
    6331\caption{Processing Time (y axis: CPU cycles per byte)}
    64 \label{multithread_perf}
     32\label{perf}
    6533\end{figure}
    66 
    67 Figure \ref{power} shows the average power consumed by the multi-threaded Parabix in comparison with the single-threaded version.
    68 By running four threads and using all the cores at the same time, the power consumption of the multi-threaded Parabix is much higher
    69 than the single-threaded version. However, the energy consumption is about the same, because the multi-threaded Parabix needs less processing time.
    70 In fact, as shown in Figure \ref{energy}, parsing soap.xml using multi-threaded Parabix consumes less energy than using single-threaded Parabix.
    7134
    7235\begin{figure}
    7336\begin{center}
    74 \includegraphics[width=0.5\textwidth]{plots/power.pdf}
     37\includegraphics[width=0.5\textwidth]{plots/perf_energy.pdf}
    7538\end{center}
    76 \caption{Average Power (watts)}
    77 \label{power}
    78 \end{figure}
    79 \begin{figure}
    80 \begin{center}
    81 \includegraphics[width=0.5\textwidth]{plots/energy.pdf}
    82 \end{center}
    83 \caption{Energy Consumption (nJ per byte)}
    84 \label{energy}
     39\caption{Energy vs. Performance (x axis: bytes per cycle, y axis: nJ per byte)}
     40\label{perf_energy}
    8541\end{figure}
    8642
Note: See TracChangeset for help on using the changeset viewer.