source: docs/HPCA2012/09-pipeline.tex @ 1347

Last change on this file since 1347 was 1347, checked in by lindanl, 8 years ago

figures

File size: 5.6 KB
Line 
1\section{Multi-threaded Parabix}
2\label{section:multithread}
3The general problem of addressing performance through multicore parallelism
4is the increasing energy cost. As discussed in previous sections,
5Parabix, which applies SIMD-based techniques can not only achieves better performance but consumes less energy.
6Moreover, using mulitiple cores, we can further improve the performance of Parabix while keeping the energy consumption at the same level.
7
8A typical approach to parallelizing software, data parallelism, requires nearly independent data,
9However, the nature of XML files makes them hard to partition nicely for data parallelism.
10Several approaches have been used to address this problem.
11A preparsing phase has been proposed to help partition the XML document \cite{dataparallel}.
12The goal of this preparsing is to determine the tree structure of the XML document
13so that it can be used to guide the full parsing in the next phase.
14Another data parallel algorithm is called ParDOM \cite{Shah:2009}.
15It first builds partial DOM node tree structures for each data segments and then link them
16using preorder numbers that has been assigned to each start element to determine the ordering among siblings
17and a stack to manage the parent-child relationship between elements.
18
19Data parallelism approaches introduce a lot of overheads to solve the data dependencies between segments.
20Therefore, instead of partitioning the data into segments and assigning different data segments to different cores,
21we propose a pipeline parallelism strategy that partitions the process into several stages and let each core work with one single stage.
22
23The interface between stages is implemented using a circular array,
24where each entry consists of all ten data structures for one segment as listed in Table \ref{pass_structure}.
25Each thread keeps an index of the array ($I_N$),
26which is compared with the index ($I_{N-1}$) kept by its previous thread before processing the segment.
27If $I_N$ is smaller than $I_{N-1}$, thread N can start processing segment $I_N$,
28otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is larger than $I_N$.
29The time consumed by continuously loading the value of $I_{N-1}$ and
30comparing it with $I_N$ will be later referred as stall time.
31When a thread finishes processing the segment, it increases the index by one.
32
33\begin{table*}[t]
34{
35\centering
36\footnotesize
37\begin{center}
38\begin{tabular}{|c|@{~}c@{~}|c|@{~}c@{~}|c@{~}|@{~}c@{~}|c|@{~}c@{~}|c|@{~}c@{~}|c|@{~}c@{~}|}
39\hline
40       & & \multicolumn{10}{|c|}{Data Structures}\\ \hline
41       &                & data\_buffer& basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & err\_streams\\ \hline
42Stage1 &read\_data      & write       &             &      &       &       &        &        &        &            &               \\ 
43       &transposition   & read        & write       &      &       &       &        &        &        &            &               \\ 
44       &classification  &             & read        &      & write &       &        &        &        &            &               \\ \hline
45Stage2 &validate\_u8    &             & read        & write&       &       &        &        &        &            &               \\ 
46       &gen\_scope      &             &             &      & read  & write &        &        &        &            &               \\ 
47       &parse\_CtCDPI   &             &             &      & read  & read  & write  &        &        &            & write         \\ 
48       &parse\_ref      &             &             &      & read  & read  & read   & write  &        &            &               \\ \hline
49Stage3 &parse\_tag      &             &             &      & read  & read  & read   &        & write  &            &               \\ 
50       &validate\_name  &             &             & read & read  &       & read   & read   & read   & write      & write         \\ 
51       &gen\_check      &             &             & read & read  & read  & read   &        & read   & read       & write         \\ \hline
52Stage4 &postprocessing  & read        &             &      & read  &       & read   & read   &        &            & read          \\ \hline
53\end{tabular}
54\end{center}
55\caption{Relationship between Each Pass and Data Structures} 
56\label{pass_structure}
57} 
58\end{table*}
59
60Figure \ref{multithread_perf} demonstrates the XML well-formedness checking performance of
61the multi-threaded Parabix in comparison with the single-threaded version.
62The multi-threaded Parabix is more than two times faster and runs at 2.7 cycles per input byte on the \SB{} machine.
63
64
65Figure \ref{pipeline_power} shows the average power consumed by the multi-threaded Parabix in comparison with the single-threaded version.
66By running four threads and using all the cores at the same time, the power consumption of the multi-threaded Parabix is much higher
67than the single-threaded version. However, the energy consumption is about the same, because the multi-threaded Parabix needs less processing time.
68In fact, as shown in Figure \ref{pipeline_energy}, parsing soap.xml using multi-threaded Parabix consumes less energy than using single-threaded Parabix.
69
70\begin{figure}
71\subfigure[Performance (Cycles / kByte)]{
72\includegraphics[width=0.32\textwidth]{plots/pipeline_performance.pdf}
73\label{pipeline_performance}
74}
75\subfigure[Avg. Power Consumption]{
76\includegraphics[width=0.32\textwidth]{plots/pipeline_power.pdf}
77\label{pipeline_power}
78}
79\subfigure[Avg. Energy Consumption (nJ / Byte)]{
80  \includegraphics[width=0.32\textwidth]{plots/pipeline_energy.pdf}
81\label{pipeline_energy}
82}
83\caption{Multithreaded Parabix}
84\label{multithread_perf}
85\end{figure}
86
Note: See TracBrowser for help on using the repository browser.