source: docs/HPCA2012/09-pipeline.tex @ 1348

Last change on this file since 1348 was 1348, checked in by ashriram, 8 years ago

new abstract for new intro

File size: 5.6 KB
Line 
1\section{Multi-threaded Parabix}
2\label{section:multithread}
3The general problem of addressing performance through multicore parallelism
4is the increasing energy cost. As discussed in previous sections,
5Parabix, which applies SIMD-based techniques can not only achieves better performance but consumes less energy.
6Moreover, using mulitiple cores, we can further improve the performance of Parabix while keeping the energy consumption at the same level.
7
8A typical approach to parallelizing software, data parallelism,
9requires nearly independent data, However, the nature of XML files
10makes them hard to partition nicely for data parallelism.  Several
11approaches have been used to address this problem.  A preparsing phase
12has been proposed to help partition the XML document
13\cite{dataparallel}.  The goal of this preparsing is to determine the
14tree structure of the XML document so that it can be used to guide the
15full parsing in the next phase.  Another data parallel algorithm is
16called ParDOM \cite{Shah:2009}.  It first builds partial DOM node tree
17structures for each data segments and then link them using preorder
18numbers that has been assigned to each start element to determine the
19ordering among siblings and a stack to manage the parent-child
20relationship between elements.
21
22Data parallelism approaches introduce a lot of overheads to solve the
23data dependencies between segments.  Therefore, instead of
24partitioning the data into segments and assigning different data
25segments to different cores, we propose a pipeline parallelism
26strategy that partitions the process into several stages and let each
27core work with one single stage.
28
29The interface between stages is implemented using a circular array,
30where each entry consists of all ten data structures for one segment
31as listed in Table \ref{pass_structure}.  Each thread keeps an index
32of the array ($I_N$), which is compared with the index ($I_{N-1}$)
33kept by its previous thread before processing the segment.  If $I_N$
34is smaller than $I_{N-1}$, thread N can start processing segment
35$I_N$, otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is
36larger than $I_N$.  The time consumed by continuously loading the
37value of $I_{N-1}$ and comparing it with $I_N$ will be later referred
38as stall time.  When a thread finishes processing the segment, it
39increases the index by one.
40
41\begin{table*}[t]
42{
43\centering
44\footnotesize
45\begin{center}
46\begin{tabular}{|c|@{~}c@{~}|c|@{~}c@{~}|c@{~}|@{~}c@{~}|c|@{~}c@{~}|c|@{~}c@{~}|c|@{~}c@{~}|}
47\hline
48       & & \multicolumn{10}{|c|}{Data Structures}\\ \hline
49       &                & data\_buffer& basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & err\_streams\\ \hline
50Stage1 &read\_data      & write       &             &      &       &       &        &        &        &            &               \\ 
51       &transposition   & read        & write       &      &       &       &        &        &        &            &               \\ 
52       &classification  &             & read        &      & write &       &        &        &        &            &               \\ \hline
53Stage2 &validate\_u8    &             & read        & write&       &       &        &        &        &            &               \\ 
54       &gen\_scope      &             &             &      & read  & write &        &        &        &            &               \\ 
55       &parse\_CtCDPI   &             &             &      & read  & read  & write  &        &        &            & write         \\ 
56       &parse\_ref      &             &             &      & read  & read  & read   & write  &        &            &               \\ \hline
57Stage3 &parse\_tag      &             &             &      & read  & read  & read   &        & write  &            &               \\ 
58       &validate\_name  &             &             & read & read  &       & read   & read   & read   & write      & write         \\ 
59       &gen\_check      &             &             & read & read  & read  & read   &        & read   & read       & write         \\ \hline
60Stage4 &postprocessing  & read        &             &      & read  &       & read   & read   &        &            & read          \\ \hline
61\end{tabular}
62\end{center}
63\caption{Relationship between Each Pass and Data Structures} 
64\label{pass_structure}
65} 
66\end{table*}
67
68Figure \ref{multithread_perf} demonstrates the XML well-formedness checking performance of
69the multi-threaded Parabix in comparison with the single-threaded version.
70The multi-threaded Parabix is more than two times faster and runs at 2.7 cycles per input byte on the \SB{} machine.
71
72
73Figure \ref{pipeline_power} shows the average power consumed by the multi-threaded Parabix in comparison with the single-threaded version.
74By running four threads and using all the cores at the same time, the power consumption of the multi-threaded Parabix is much higher
75than the single-threaded version. However, the energy consumption is about the same, because the multi-threaded Parabix needs less processing time.
76In fact, as shown in Figure \ref{pipeline_energy}, parsing soap.xml using multi-threaded Parabix consumes less energy than using single-threaded Parabix.
77
78\begin{figure}
79\subfigure[Performance (Cycles / kByte)]{
80\includegraphics[width=0.32\textwidth]{plots/pipeline_performance.pdf}
81\label{pipeline_performance}
82}
83\subfigure[Avg. Power Consumption]{
84\includegraphics[width=0.32\textwidth]{plots/pipeline_power.pdf}
85\label{pipeline_power}
86}
87\subfigure[Avg. Energy Consumption (nJ / Byte)]{
88  \includegraphics[width=0.32\textwidth]{plots/pipeline_energy.pdf}
89\label{pipeline_energy}
90}
91\caption{Multithreaded Parabix}
92\label{multithread_perf}
93\end{figure}
94
Note: See TracBrowser for help on using the repository browser.