source: docs/HPCA2012/final_ieee/09-pipeline.tex

Last change on this file was 1784, checked in by cameron, 8 years ago

Eliminate duplicate refs; Fluke ref.

File size: 5.0 KB
RevLine 
[1737]1\section{Multithreaded Parabix}
2\label{section:multithread}
3Even if an application is infinitely parallelizable and thread
4synchronization costs are non-existent, all applications are constrained by
[1774]5the power and energy overheads incurred when utilizing multiple cores;
[1737]6as more cores are put to work, a proportional increase in power occurs.
7Unfortunately, due to the runtime overheads associated with
8thread management and data synchronization, it is very hard to obtain
9corresponding improvements in performance resulting in increased
10energy costs. Parabix-XML can improve performance and reduce energy consumption
11by improving the overall computation efficiency.
[1783]12 In this section, we discuss our parallelized version of Parabix-XML
[1737]13to study the effects of thread-level parallelism in
14conjunction with Parabix-XML's data parallelism.
15
16
17The typical approach to handling data parallelism with multiple threads
[1774]18involves partitioning data uniformly across the threads. However, XML
[1737]19parsing is inherently sequential, which makes it difficult to
20partition the data. Several attempts have been made to address this
[1774]21problem. For example, using a preparsing phase to help determine the tree structure
[1737]22and to partition the XML document accordingly~\cite{dataparallel}.
[1784]23Another approach involved speculatively partitioning the data~\cite{ParaDOM2009} but
[1737]24this introduced a significant level of complexity into the overall logic of the program.
25
[1738]26\begin{table}[htbp]
[1737]27{
28\centering
29\footnotesize
30\begin{center}
[1743]31\begin{tabular}{|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|}
[1737]32\hline
[1738]33
34        & functions                                            & latency(C/B)     \\ \hline
35Stage1  & read\_data, transposition, classification            & 1.97             \\ \hline
36Stage2  & validate\_u8,  gen\_scope, parse\_CtCDPI, parse\_ref & 1.22             \\ \hline
37Stage3  & parse\_tag validate\_name gen\_check                 & 2.03             \\ \hline
38Stage4  & postprocessing                                       & 1.32             \\ \hline
[1737]39\end{tabular}
40\end{center}
[1738]41\caption{Stage Division} 
[1737]42\label{pass_structure}
43} 
[1738]44\end{table}
[1737]45
46In contrast to those methods, we adopted a parallelism strategy that requires
47neither speculation nor pre-parsing.
48As described in Section~\ref{section:parser}, Parabix-XML consists of multiple
[1783]49passes that operate on every chunk of input data, and interact with each other
[1737]50in sequence with no data movement from later to earlier
51passes. This fits well into the mold of pipeline parallelism. We
52partitioned Parabix-XML into four stages and assigned a core to
53each to stage. One of the key challenges was to determine which passes
[1774]54should be grouped together. We analyzed the latency and data dependencies of each of the passes
55in the single-threaded version of Parabix (Column 3 in Table~\ref{pass_structure}),
56and assigned the passes to stages to maximize throughput.
[1737]57
58
59The interface between stages is implemented using a ring buffer, where
[1738]60each entry consists of all ten data structures for one segment.
61Each pipeline stage $S$ maintains
[1737]62the index of the buffer entry ($I_S$) that is being processed. Before
63processing the next buffer frame the stage check if the previous stage
64is done by spinning on $I_{S-1}$ (Stage $S-1$'s buffer entry). In
65commodity multicore chips typically all threads share the last level
66cache. If we let the faster pipeline stages run ahead, the data they
67process will increase contention to the shared cache. To prevent this
68we limit how far the faster pipeline stages can run ahead by
69controlling the overall size of the ring buffer. Whenever a faster stage
70runs ahead, it will effectively cause the ring buffer to fill up and
[1774]71force that stage to stall. Experiments show that six entries of the
[1738]72circular buffer gives the best performance.
[1737]73
74Figure~\ref{multithread_perf} demonstrates the performance improvement
75achieved by pipelined Parabix-XML in comparison with the
76single-threaded version.  The 4-threaded version is $\simeq2\times$
77faster compared to the single threaded version and achieves
[1774]78$\simeq2.7$ cycles per input byte by exploiting the SIMD units of all
[1737]79\SB{}'s cores.  This performance approaches the 1 cycle per byte
80performance of custom hardware solutions~\cite{DaiNiZhu2010}. Parabix
81demonstrates the potential to enable an entire new class of
82applications, text processing, to exploit multicores.
83
84
[1738]85Figure \ref{multithread_perf} also shows the average power consumed by the
[1737]86multithreaded Parabix. Overall, as expected the power consumption
87increases in proportion to the number of
88active cores. Note that the increase is not linear, since shared units
89such as last-level-caches consume active power even if only one core
[1738]90is active. Perhaps more interestingly there is a reduction in execution time,
91which leads to the energy consumption being similar to the the single-thread execution
92(in some cases marginally less energy e.g., soap.xml). 
[1737]93
[1738]94\begin{figure}[htbp]
95\begin{center}
96{
97\includegraphics[width=0.50\textwidth]{plots/pipeline.pdf}
[1737]98}
[1738]99\end{center}
100\caption{Average Statistic of Multithreaded Parabix}
[1737]101\label{multithread_perf}
[1738]102\end{figure}
[1737]103
Note: See TracBrowser for help on using the repository browser.