source: docs/HPCA2012/final_ieee/09-pipeline.tex @ 1782

Last change on this file since 1782 was 1774, checked in by lindanl, 7 years ago

minor changes

File size: 5.0 KB
Line 
1\section{Multithreaded Parabix}
2\label{section:multithread}
3Even if an application is infinitely parallelizable and thread
4synchronization costs are non-existent, all applications are constrained by
5the power and energy overheads incurred when utilizing multiple cores;
6as more cores are put to work, a proportional increase in power occurs.
7Unfortunately, due to the runtime overheads associated with
8thread management and data synchronization, it is very hard to obtain
9corresponding improvements in performance resulting in increased
10energy costs. Parabix-XML can improve performance and reduce energy consumption
11by improving the overall computation efficiency.
12However, up to this point, we restricted Parabix-XML to a
13single core. In this section, we discuss our parallelized version of Parabix-XML
14to study the effects of thread-level parallelism in
15conjunction with Parabix-XML's data parallelism.
16
17
18The typical approach to handling data parallelism with multiple threads
19involves partitioning data uniformly across the threads. However, XML
20parsing is inherently sequential, which makes it difficult to
21partition the data. Several attempts have been made to address this
22problem. For example, using a preparsing phase to help determine the tree structure
23and to partition the XML document accordingly~\cite{dataparallel}.
24Another approach involved speculatively partitioning the data~\cite{Shah:2009} but
25this introduced a significant level of complexity into the overall logic of the program.
26
27\begin{table}[htbp]
28{
29\centering
30\footnotesize
31\begin{center}
32\begin{tabular}{|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|}
33\hline
34
35        & functions                                            & latency(C/B)     \\ \hline
36Stage1  & read\_data, transposition, classification            & 1.97             \\ \hline
37Stage2  & validate\_u8,  gen\_scope, parse\_CtCDPI, parse\_ref & 1.22             \\ \hline
38Stage3  & parse\_tag validate\_name gen\_check                 & 2.03             \\ \hline
39Stage4  & postprocessing                                       & 1.32             \\ \hline
40\end{tabular}
41\end{center}
42\caption{Stage Division} 
43\label{pass_structure}
44} 
45\end{table}
46
47In contrast to those methods, we adopted a parallelism strategy that requires
48neither speculation nor pre-parsing.
49As described in Section~\ref{section:parser}, Parabix-XML consists of multiple
50passes that, on every chunk of input data, interact with each other
51in sequence with no data movement from later to earlier
52passes. This fits well into the mold of pipeline parallelism. We
53partitioned Parabix-XML into four stages and assigned a core to
54each to stage. One of the key challenges was to determine which passes
55should be grouped together. We analyzed the latency and data dependencies of each of the passes
56in the single-threaded version of Parabix (Column 3 in Table~\ref{pass_structure}),
57and assigned the passes to stages to maximize throughput.
58
59
60The interface between stages is implemented using a ring buffer, where
61each entry consists of all ten data structures for one segment.
62Each pipeline stage $S$ maintains
63the index of the buffer entry ($I_S$) that is being processed. Before
64processing the next buffer frame the stage check if the previous stage
65is done by spinning on $I_{S-1}$ (Stage $S-1$'s buffer entry). In
66commodity multicore chips typically all threads share the last level
67cache. If we let the faster pipeline stages run ahead, the data they
68process will increase contention to the shared cache. To prevent this
69we limit how far the faster pipeline stages can run ahead by
70controlling the overall size of the ring buffer. Whenever a faster stage
71runs ahead, it will effectively cause the ring buffer to fill up and
72force that stage to stall. Experiments show that six entries of the
73circular buffer gives the best performance.
74
75Figure~\ref{multithread_perf} demonstrates the performance improvement
76achieved by pipelined Parabix-XML in comparison with the
77single-threaded version.  The 4-threaded version is $\simeq2\times$
78faster compared to the single threaded version and achieves
79$\simeq2.7$ cycles per input byte by exploiting the SIMD units of all
80\SB{}'s cores.  This performance approaches the 1 cycle per byte
81performance of custom hardware solutions~\cite{DaiNiZhu2010}. Parabix
82demonstrates the potential to enable an entire new class of
83applications, text processing, to exploit multicores.
84
85
86Figure \ref{multithread_perf} also shows the average power consumed by the
87multithreaded Parabix. Overall, as expected the power consumption
88increases in proportion to the number of
89active cores. Note that the increase is not linear, since shared units
90such as last-level-caches consume active power even if only one core
91is active. Perhaps more interestingly there is a reduction in execution time,
92which leads to the energy consumption being similar to the the single-thread execution
93(in some cases marginally less energy e.g., soap.xml). 
94
95\begin{figure}[htbp]
96\begin{center}
97{
98\includegraphics[width=0.50\textwidth]{plots/pipeline.pdf}
99}
100\end{center}
101\caption{Average Statistic of Multithreaded Parabix}
102\label{multithread_perf}
103\end{figure}
104
Note: See TracBrowser for help on using the repository browser.