Changeset 1320


Ignore:
Timestamp:
Aug 18, 2011, 11:58:21 AM (8 years ago)
Author:
lindanl
Message:

multi-thread section

Location:
docs/HPCA2011
Files:
2 added
3 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2011/09-pipeline.tex

    r1302 r1320  
    11\section{Multi-threaded Parabix}
     2The general problem of addressing performance through multicore parallelism
     3is the increasing energy cost. As discussed in previous sections,
     4Parabix, which applies SIMD-based techniques can not only achieves better performance but consumes less energy.
     5Moreover, using mulitiple cores, we can further improve the performance of Parabix while keeping the energy consumption at the same level.
     6
     7The typical approach to parallelizing software (data parallelism)
     8requires nearly independent data, which is a difficult task
     9for dividing XML data. A simple division determined by the
     10segment size can easily make most of the segments illegal
     11according to the parsing rules while the data as a whole is legal.
     12Therefore, instead of dividing the data into segments and
     13assigning different data segments to different cores,
     14we divide the process into several stages and let each core work with one single stage.
     15
     16The interface between stages is implemented using a circular array,
     17where each entry consists of all ten data structures for one segment as listed in Table \ref{pass_structure}.
     18Each thread keeps an index of the array ($I_N$),
     19which is compared with the index ($I_{N-1}$) kept by its previous thread before processing the segment.
     20If $I_N$ is smaller than $I_{N-1}$, thread N can start processing segment $I_N$,
     21otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is larger than $I_N$.
     22The time consumed by continuously loading the value of $I_{N-1}$ and
     23comparing it with $I_N$ will be later referred as stall time.
     24When a thread finishes processing the segment, it increases the index by one.
    225
    326\begin{table*}[t]
     
    528\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|}
    629\hline
    7 Stage Name & \multicolumn{10}{|c|}{Data Structures}\\ \hline
    8                 & srcbuf & basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & check\_streams\\ \hline
    9 fill\_buffer    & write  &             &      &       &       &        &        &        &            &               \\ \hline
    10 s2p             & read   & write       &      &       &       &        &        &        &            &               \\ \hline
    11 classify\_bytes &        & read        &      & write &       &        &        &        &            &               \\ \hline
    12 validate\_u8    &        & read        & write&       &       &        &        &        &            &               \\ \hline
    13 gen\_scope      &        &             &      & read  & write &        &        &        &            &               \\ \hline
    14 parse\_CtCDPI   &        &             &      & read  & read  & write  &        &        &            & write         \\ \hline
    15 parse\_ref      &        &             &      & read  & read  & read   & write  &        &            &               \\ \hline
    16 parse\_tag      &        &             &      & read  & read  & read   &        & write  &            &               \\ \hline
    17 validate\_name  &        &             & read & read  &       & read   & read   & read   & write      & write         \\ \hline
    18 gen\_check      &        &             & read & read  & read  & read   &        & read   & read       & write         \\ \hline
    19 postprocessing  & read   &             &      & read  &       & read   & read   &        &            & read          \\ \hline
     30       & & \multicolumn{10}{|c|}{Data Structures}\\ \hline
     31       &                & srcbuf & basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & check\_streams\\ \hline
     32Stage1 &fill\_buffer    & write  &             &      &       &       &        &        &        &            &               \\
     33       &s2p             & read   & write       &      &       &       &        &        &        &            &               \\
     34       &classify\_bytes &        & read        &      & write &       &        &        &        &            &               \\ \hline
     35Stage2 &validate\_u8    &        & read        & write&       &       &        &        &        &            &               \\
     36       &gen\_scope      &        &             &      & read  & write &        &        &        &            &               \\
     37       &parse\_CtCDPI   &        &             &      & read  & read  & write  &        &        &            & write         \\
     38       &parse\_ref      &        &             &      & read  & read  & read   & write  &        &            &               \\ \hline
     39Stage3 &parse\_tag      &        &             &      & read  & read  & read   &        & write  &            &               \\
     40       &validate\_name  &        &             & read & read  &       & read   & read   & read   & write      & write         \\
     41       &gen\_check      &        &             & read & read  & read  & read   &        & read   & read       & write         \\ \hline
     42Stage4 &postprocessing  & read   &             &      & read  &       & read   & read   &        &            & read          \\ \hline
    2043\end{tabular}
    2144\end{center}
     
    2447\end{table*}
    2548
     49Figure \ref{multithread_perf} demonstrates the XML well-formedness checking performance of
     50the multi-threaded Parabix in comparison with the single-threaded version.
     51The multi-threaded Parabix is more than two times faster and runs at 2.7 cycles per input byte on the \SB{} machine.
    2652
    2753\begin{figure}
     
    3056\end{center}
    3157\caption{Processing Time (y axis: CPU cycles per byte)}
    32 \label{perf}
     58\label{multithread_perf}
    3359\end{figure}
     60
     61Figure \ref{power} shows the average power consumed by the multi-threaded Parabix in comparison with the single-threaded version.
     62By running four threads and using all the cores at the same time, the power consumption of the multi-threaded Parabix is much higher
     63than the single-threaded version. However, the energy consumption is about the same, because the multi-threaded Parabix needs less processing time.
     64In fact, as shown in Figure \ref{energy}, parsing soap.xml using multi-threaded Parabix consumes less energy than using single-threaded Parabix.
    3465
    3566\begin{figure}
    3667\begin{center}
    37 \includegraphics[width=0.5\textwidth]{plots/perf_energy.pdf}
     68\includegraphics[width=0.5\textwidth]{plots/power.pdf}
    3869\end{center}
    39 \caption{Energy vs. Performance (x axis: bytes per cycle, y axis: nJ per byte)}
    40 \label{perf_energy}
     70\caption{Average Power (watts)}
     71\label{power}
     72\end{figure}
     73\begin{figure}
     74\begin{center}
     75\includegraphics[width=0.5\textwidth]{plots/energy.pdf}
     76\end{center}
     77\caption{Energy Consumption (nJ per byte)}
     78\label{energy}
    4179\end{figure}
    4280
  • docs/HPCA2011/main.aux

    r1302 r1320  
    126126\@writefile{lof}{\contentsline {figure}{\numberline {22}{\ignorespaces Relative Slow Down of Parbix2 and Expat on GT-P1000M vs. Core-i3{} \relax }}{10}}
    127127\newlabel{relative_performance_arm_vs_i3}{{22}{10}}
    128 \@writefile{lof}{\contentsline {figure}{\numberline {23}{\ignorespaces Processing Time (y axis: CPU cycles per byte)\relax }}{10}}
    129 \newlabel{perf}{{23}{10}}
    130128\@writefile{toc}{\contentsline {section}{\numberline {9}Multi-threaded Parabix}{10}}
    131 \@writefile{lof}{\contentsline {figure}{\numberline {24}{\ignorespaces Energy vs. Performance (x axis: bytes per cycle, y axis: nJ per byte)\relax }}{10}}
    132 \newlabel{perf_energy}{{24}{10}}
    133129\bibstyle{abbrv}
    134130\bibdata{reference}
     
    138134\bibcite{TR:XML}{4}
    139135\bibcite{Cameron2009}{5}
     136\@writefile{lot}{\contentsline {table}{\numberline {6}{\ignorespaces Relationship between Each Pass and Data Structures\relax }}{11}}
     137\newlabel{pass_structure}{{6}{11}}
     138\@writefile{lof}{\contentsline {figure}{\numberline {23}{\ignorespaces Processing Time (y axis: CPU cycles per byte)\relax }}{11}}
     139\newlabel{multithread_perf}{{23}{11}}
     140\@writefile{lof}{\contentsline {figure}{\numberline {24}{\ignorespaces Average Power (watts)\relax }}{11}}
     141\newlabel{power}{{24}{11}}
     142\@writefile{toc}{\contentsline {section}{\numberline {10}Conclusion}{11}}
     143\@writefile{lof}{\contentsline {figure}{\numberline {25}{\ignorespaces Energy Consumption (nJ per byte)\relax }}{11}}
     144\newlabel{energy}{{25}{11}}
     145\@writefile{toc}{\contentsline {section}{\numberline {11}References}{11}}
    140146\bibcite{Cameron2008}{6}
    141147\bibcite{Cameron2010}{7}
     
    156162\bibcite{ParaDOM2009}{22}
    157163\bibcite{ZhangPanChiu09}{23}
    158 \@writefile{lot}{\contentsline {table}{\numberline {6}{\ignorespaces Relationship between Each Pass and Data Structures\relax }}{11}}
    159 \newlabel{pass_structure}{{6}{11}}
    160 \@writefile{toc}{\contentsline {section}{\numberline {10}Conclusion}{11}}
    161 \@writefile{toc}{\contentsline {section}{\numberline {11}References}{11}}
Note: See TracChangeset for help on using the changeset viewer.