source: docs/HPCA2012/09-pipeline.tex @ 1407

Last change on this file since 1407 was 1407, checked in by ashriram, 8 years ago

Minor bug fixes

File size: 6.9 KB
Line 
1\section{Multithreaded Parabix}
2\label{section:multithread}
3The general challenge with boosting performance through multicore
4parallelism is the power and energy overheads. When we put more cores
5to work on a chip it results in proportionate increase in
6power. Unfortunately, due to the runtime overheads associated with
7thread management and data synchronization it is very hard to obtain
8corresponding improvements in performance resulting in increased
9energy costs.  Parabix, which exploits intra-core fine-grain
10SIMD-based can improve performance and achieve corresponding
11improvements in energy by improving the overall compute efficiency.
12However, Parabix is restricted to the resources available within a
13single core. In this section we have parallelized the Parabix XML
14parser by hand to study the effects of thread level parallelism in
15conjunction with Parabix's data parallelism.
16
17
18A typical approach to handling data parallelism with multiple threads,
19requires partitioning data uniformly across the threads. However XML
20parsing inherently moves through data in sequence, creates data
21dependencies between the different phases and makes it hard to
22partition the file. A few attempts have been made to address this
23problem using a pre-parsing phase to  help determine the tree structure
24and to partition the XML document~\cite{dataparallel}. There have been
25other approaches to speculatively partition the data~\cite{Shah:2009}
26but this requires introducing significant complexity in the overall
27logic of the program.
28
29\begin{table*}[!h]
30{
31\centering
32\footnotesize
33\begin{center}
34\begin{tabular}{|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|}
35\hline
36        &      & & \multicolumn{10}{|c|}{Data Structures}\\ \hline
37        &      &                & data\_buffer& basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & err\_streams\\ \hline
38        & latency(C/B)     & size (B)       & 128         & 128         & 496  & 448   & 80    & 176    & 112    & 176    & 16         & 112           \\ \hline
39Stage1  & 1.97 &read\_data      & write       &             &      &       &       &        &        &        &            &               \\ 
40        &      &transposition   & read        & write       &      &       &       &        &        &        &            &               \\ 
41        &      &classification  &             & read        &      & write &       &        &        &        &            &               \\ \hline
42Stage2  & 1.22 &validate\_u8    &             & read        & write&       &       &        &        &        &            &               \\ 
43        &      &gen\_scope      &             &             &      & read  & write &        &        &        &            &               \\ 
44        &      &parse\_CtCDPI   &             &             &      & read  & read  & write  &        &        &            & write         \\ 
45        &      &parse\_ref      &             &             &      & read  & read  & read   & write  &        &            &               \\ \hline
46Stage3  & 2.03 &parse\_tag      &             &             &      & read  & read  & read   &        & write  &            &               \\ 
47        &      &validate\_name  &             &             & read & read  &       & read   & read   & read   & write      & write         \\ 
48        &      &gen\_check      &             &             & read & read  & read  & read   &        & read   & read       & write         \\ \hline
49Stage4  & 1.32 &postprocessing  & read        &             &      & read  &       & read   & read   &        &            & read          \\ \hline
50\end{tabular}
51\end{center}
52\caption{Relationship between Each Pass and Data Structures} 
53\label{pass_structure}
54} 
55\end{table*}
56
57
58We adopt a contrasting approach to parallelizing the Parabix XML
59parser.  As described in Section~\ref{section:parser} Parabix consists of multiple
60passes that on every chunk of input data and each of these stages
61interact in sequence with no data movement from later to earlier
62passes. This fits well into the mold of pipeline parallelism and we
63partition the overall parser into several stages and assign a core to
64each to stage. One of the key challenges is to evaluate which passes
65need to be grouped into a stage. We analyzed the latency of each of
66the passes in the single-threaded version of Parabix XML parser
67(Column 1 in Table~\ref{pass_structure}) and assigned the passes them
68to stages such that the stage latency of the overall pipeline in
69balanced resulting in maximal throughput.
70
71
72The interface between stages is implemented using a ring buffer, where
73each entry consists of all ten data structures for one segment as
74listed in Table \ref{pass_structure}.  Each pipeline stage S maintains
75the index of the buffer entry ($I_S$) that is being processed. Before
76processing the next buffer frame the stage check if the previous stage
77is done by spinning on $I_{S-1}$ (Stage S-1's buffer entry). In
78commodity multicore chips typically all threads share the last level
79cache and if we let faster pipeline stages run ahead the data they
80process will increase contention to the shared cache. To prevent this
81we optimize how far the faster pipeline stages can run ahead by
82controlling the overall size of the ring buffer. The faster stage if
83it runs ahead will effectively cause the ring buffer to fill up and
84cause the stage implicitly stall.
85
86
87
88Figure~\ref{multithread_perf} demonstrates the performance improvement
89achieved by pipelined Parabix in comparison with the single-threaded
90version.  The multithreaded is $\simeq$2$\times$ faster compared to
91the single threaded version and achieves $\simeq$ 2.7 cycles per input
92byte by exploiting SIMD units of all \SB{}'s cores.  This performance  approaches the
93performance of custom hardware solutions. Parabix demonstrates the
94potential to enable an entire new class of applications, text
95processing, to exploit multicores.
96
97
98Figure \ref{pipeline_power} shows the average power consumed by the
99multithreaded Parabix. Overall, as expected the power consumption of
100the multithreaded Parabix increases in proportion to the number of
101active cores. Note that the increase is not linear, since shared units
102such as last-level-caches consume active power even if a single core
103is active. Perhaps more interestingly we achieve corresponding
104reduction in execution time which leads to same energy consumption as
105the single-thread version (in some cases marginally less energy as shown for
106soap.xml). 
107
108\begin{figure}
109\subfigure[Performance (Cycles / kByte)]{
110\includegraphics[width=0.32\textwidth]{plots/pipeline_performance.pdf}
111\label{pipeline_performance}
112}
113\subfigure[Avg. Power Consumption]{
114\includegraphics[width=0.32\textwidth]{plots/pipeline_power.pdf}
115\label{pipeline_power}
116}
117\subfigure[Avg. Energy Consumption (nJ / Byte)]{
118  \includegraphics[width=0.32\textwidth]{plots/pipeline_energy.pdf}
119\label{pipeline_energy}
120}
121\caption{Multithreaded Parabix}
122\label{multithread_perf}
123\end{figure}
124
Note: See TracBrowser for help on using the repository browser.