source: docs/HPCA2012/09-pipeline.tex @ 1390

Last change on this file since 1390 was 1390, checked in by lindanl, 8 years ago

fix spelling mistakes

File size: 6.9 KB
Line 
1\section{Multithreaded Parabix}
2\label{section:multithread}
3The general challenge with boosting performance through multicore
4parallelism is the power and energy overheads. When we put more cores
5to work on a chip it results in proportionate increase in
6power. Unfortunately, due to the runtime overheads associated with
7thread management and data synchronization it is very hard to obtain
8corresponding improvements in performance resulting in increased
9energy costs.  Parabix, which exploits intra-core fine-grain
10SIMD-based can improve performance and achieve corresponding
11improvements in energy by improving the overall compute efficiency.
12However, Parabix is restricted to the resources available within a
13single core. In this section we have parallelized the Parabix XML
14parser by hand to study the effects of thread level parallelism in
15conjunction with Parabix's data parallelism.
16
17
18A typical approach to handling data parallelism with multiple threads,
19requires partitioning data uniformly across the threads. However XML
20parsing inherently moves through data in sequence, creates data
21dependencies between the different phases and makes it hard to
22partition the file. A few attempts have been made to address this
23problem using a pre-parsing phase to  help determine the tree structure
24and to partition the XML document~\cite{dataparallel}. There have been
25other approaches to speculatively partition the data~\cite{Shah:2009}
26but this requires introducing significant complexity in the overall
27logic of the program.
28
29
30We adopt a contrasting approach to parallelizing the Parabix XML
31parser.  As described in Section~\ref{section:parser} Parabix consists of multiple
32passes that on every chunk of input data and each of these stages
33interact in sequence with no data movement from later to earlier
34passes. This fits well into the mold of pipeline parallelism and we
35partition the overall parser into several stages and assign a core to
36each to stage. One of the key challenges is to evaluate which passes
37need to be grouped into a stage. We analyzed the latency of each of
38the passes in the single-threaded version of Parabix XML parser
39(Column 1 in Table~\ref{pass_structure}) and assigned the passes them
40to stages such that the stage latency of the overall pipeline in
41balanced resulting in maximal throughput.
42
43
44The interface between stages is implemented using a ring buffer, where
45each entry consists of all ten data structures for one segment as
46listed in Table \ref{pass_structure}.  Each pipeline stage S maintains
47the index of the buffer entry ($I_S$) that is being processed. Before
48processing the next buffer frame the stage check if the previous stage
49is done by spinning on $I_{S-1}$ (Stage S-1's buffer entry). In
50commodity multicore chips typically all threads share the last level
51cache and if we let faster pipeline stages run ahead the data they
52process will increase contention to the shared cache. To prevent this
53we optimize how far the faster pipeline stages can run ahead by
54controlling the overall size of the ring buffer. The faster stage if
55it runs ahead will effectively cause the ring buffer to fill up and
56cause the stage implicitly stall.
57
58
59\begin{table*}[t]
60{
61\centering
62\footnotesize
63\begin{center}
64\begin{tabular}{|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|}
65\hline
66        &      & & \multicolumn{10}{|c|}{Data Structures}\\ \hline
67        &      &                & data\_buffer& basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & err\_streams\\ \hline
68        & latency(C/B)     & size (B)       & 128         & 128         & 496  & 448   & 80    & 176    & 112    & 176    & 16         & 112           \\ \hline
69Stage1  & 1.97 &read\_data      & write       &             &      &       &       &        &        &        &            &               \\ 
70        &      &transposition   & read        & write       &      &       &       &        &        &        &            &               \\ 
71        &      &classification  &             & read        &      & write &       &        &        &        &            &               \\ \hline
72Stage2  & 1.22 &validate\_u8    &             & read        & write&       &       &        &        &        &            &               \\ 
73        &      &gen\_scope      &             &             &      & read  & write &        &        &        &            &               \\ 
74        &      &parse\_CtCDPI   &             &             &      & read  & read  & write  &        &        &            & write         \\ 
75        &      &parse\_ref      &             &             &      & read  & read  & read   & write  &        &            &               \\ \hline
76Stage3  & 2.03 &parse\_tag      &             &             &      & read  & read  & read   &        & write  &            &               \\ 
77        &      &validate\_name  &             &             & read & read  &       & read   & read   & read   & write      & write         \\ 
78        &      &gen\_check      &             &             & read & read  & read  & read   &        & read   & read       & write         \\ \hline
79Stage4  & 1.32 &postprocessing  & read        &             &      & read  &       & read   & read   &        &            & read          \\ \hline
80\end{tabular}
81\end{center}
82\caption{Relationship between Each Pass and Data Structures} 
83\label{pass_structure}
84} 
85\end{table*}
86
87
88Figure~\ref{multithread_perf} demonstrates the performance improvement
89achieved by pipelined Parabix in comparison with the single-threaded
90version.  The multithreaded is $\simeq$2$\times$ faster compared to
91the single threaded version and achieves $\simeq$ 2.7 cycles per input
92byte by exploiting SIMD units of all \SB{}'s cores.  This performance  approaches the
93performance of custom hardware solutions. Parabix demonstrates the
94potential to enable an entire new class of applications, text
95processing, to exploit multicores.
96
97
98Figure \ref{pipeline_power} shows the average power consumed by the
99multithreaded Parabix. Overall, as expected the power consumption of
100the multithreaded Parabix increases in proportion to the number of
101active cores. Note that the increase is not linear, since shared units
102such as last-level-caches consume active power even if a single core
103is active. Perhaps more interestingly we achieve corresponding
104reduction in execution time which leads to same energy consumption as
105the single-thread version (in some cases marginally less energy as shown for
106soap.xml). 
107
108\begin{figure}
109\subfigure[Performance (Cycles / kByte)]{
110\includegraphics[width=0.32\textwidth]{plots/pipeline_performance.pdf}
111\label{pipeline_performance}
112}
113\subfigure[Avg. Power Consumption]{
114\includegraphics[width=0.32\textwidth]{plots/pipeline_power.pdf}
115\label{pipeline_power}
116}
117\subfigure[Avg. Energy Consumption (nJ / Byte)]{
118  \includegraphics[width=0.32\textwidth]{plots/pipeline_energy.pdf}
119\label{pipeline_energy}
120}
121\caption{Multithreaded Parabix}
122\label{multithread_perf}
123\end{figure}
124
Note: See TracBrowser for help on using the repository browser.