source: docs/HPCA2012/09-pipeline.tex @ 3896

Last change on this file since 3896 was 1734, checked in by ksherdy, 8 years ago

Removed duplicated that that.

File size: 7.3 KB
RevLine 
[1362]1\section{Multithreaded Parabix}
[1339]2\label{section:multithread}
[1412]3Even if an application is infinitely parallelizable and thread
4synchronization costs are non-existent, all applications are constrained by
5the power and energy overheads incurred when utilizing multiple cores:
6as more cores are put to work, a proportional increase in power occurs.
7Unfortunately, due to the runtime overheads associated with
8thread management and data synchronization, it is very hard to obtain
[1362]9corresponding improvements in performance resulting in increased
[1412]10energy costs. Parabix-XML can improve performance and reduce energy consumption
11by improving the overall computation efficiency.
12However, up to this point, we restricted Parabix-XML to a
13single core. In this section, we discuss our parallelized version of Parabix-XML
14to study the effects of thread-level parallelism in
15conjunction with Parabix-XML's data parallelism.
[1302]16
[1329]17
[1412]18The typical approach to handling data parallelism with multiple threads
19involves partitioning data uniformly across the threads. However XML
20parsing is inherently sequential, which makes it difficult to
21partition the data. Several attempts have been made to address this
22problem using a preparsing phase to help determine the tree structure
23and to partition the XML document accordingly~\cite{dataparallel}.
24Another approach involved speculatively partitioning the data~\cite{Shah:2009} but
25this introduced a significant level of complexity into the overall logic of the program.
[1329]26
[1692]27\begin{table*}[htbp]
[1407]28{
29\centering
30\footnotesize
31\begin{center}
32\begin{tabular}{|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|}
33\hline
[1412]34        &      & & \multicolumn{10}{|c|}{Data Structure Flow / Dependencies}\\ \hline
[1407]35        &      &                & data\_buffer& basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & err\_streams\\ \hline
36        & latency(C/B)     & size (B)       & 128         & 128         & 496  & 448   & 80    & 176    & 112    & 176    & 16         & 112           \\ \hline
37Stage1  & 1.97 &read\_data      & write       &             &      &       &       &        &        &        &            &               \\ 
38        &      &transposition   & read        & write       &      &       &       &        &        &        &            &               \\ 
39        &      &classification  &             & read        &      & write &       &        &        &        &            &               \\ \hline
40Stage2  & 1.22 &validate\_u8    &             & read        & write&       &       &        &        &        &            &               \\ 
41        &      &gen\_scope      &             &             &      & read  & write &        &        &        &            &               \\ 
42        &      &parse\_CtCDPI   &             &             &      & read  & read  & write  &        &        &            & write         \\ 
43        &      &parse\_ref      &             &             &      & read  & read  & read   & write  &        &            &               \\ \hline
44Stage3  & 2.03 &parse\_tag      &             &             &      & read  & read  & read   &        & write  &            &               \\ 
45        &      &validate\_name  &             &             & read & read  &       & read   & read   & read   & write      & write         \\ 
46        &      &gen\_check      &             &             & read & read  & read  & read   &        & read   & read       & write         \\ \hline
47Stage4  & 1.32 &postprocessing  & read        &             &      & read  &       & read   & read   &        &            & read          \\ \hline
48\end{tabular}
49\end{center}
50\caption{Relationship between Each Pass and Data Structures} 
51\label{pass_structure}
52} 
53\end{table*}
[1329]54
[1412]55In contrast to those methods, we adopted a parallelism strategy that requires
56neither speculation nor pre-parsing.
57As described in Section~\ref{section:parser}, Parabix-XML consists of multiple
58passes that, on every chunk of input data, interact with each other
59in sequence with no data movement from later to earlier
60passes. This fits well into the mold of pipeline parallelism. We
61partitioned Parabix-XML into four stages and assigned a core to
62each to stage. One of the key challenges was to determine which passes
63should be grouped together. By analyzing the latency and data dependencies of each of
64the passes in the single-threaded version of Parabix-XML
65(Column 1 in Table~\ref{pass_structure}), and assigned the passes
[1734]66to stages such that provided the maximal throughput.
[1407]67
[1362]68
69The interface between stages is implemented using a ring buffer, where
70each entry consists of all ten data structures for one segment as
[1412]71listed in Table \ref{pass_structure}.  Each pipeline stage $S$ maintains
[1362]72the index of the buffer entry ($I_S$) that is being processed. Before
73processing the next buffer frame the stage check if the previous stage
[1412]74is done by spinning on $I_{S-1}$ (Stage $S-1$'s buffer entry). In
[1362]75commodity multicore chips typically all threads share the last level
[1412]76cache. If we let the faster pipeline stages run ahead, the data they
[1362]77process will increase contention to the shared cache. To prevent this
[1412]78we limit how far the faster pipeline stages can run ahead by
79controlling the overall size of the ring buffer. Whenever a faster stage
80runs ahead, it will effectively cause the ring buffer to fill up and
[1693]81force that stage to stall. Figure \ref{circular_buffer} shows the performance
82with different number of entries of the circular buffer, where
836 entries gives the best performance.
[1362]84
[1693]85\begin{figure}[htbp]
86\includegraphics[width=0.5\textwidth]{plots/circular_buffer.pdf}
87\caption{Performance (CPU Cycles per kB)}
88\label{circular_buffer}
89\end{figure}
[1362]90
91Figure~\ref{multithread_perf} demonstrates the performance improvement
[1419]92achieved by pipelined Parabix-XML in comparison with the
93single-threaded version.  The 4-threaded version is $\simeq2\times$
94faster compared to the single threaded version and achieves
95$\simeq2.7$ cycles per input byte by exploiting SIMD units of all
96\SB{}'s cores.  This performance approaches the 1 cycle per byte
97performance of custom hardware solutions~\cite{DaiNiZhu2010}. Parabix
98demonstrates the potential to enable an entire new class of
99applications, text processing, to exploit multicores.
[1302]100
[1329]101
[1362]102Figure \ref{pipeline_power} shows the average power consumed by the
[1412]103multithreaded Parabix. Overall, as expected the power consumption
104increases in proportion to the number of
[1362]105active cores. Note that the increase is not linear, since shared units
[1412]106such as last-level-caches consume active power even if only one core
[1422]107is active. Perhaps more interestingly there is a reduction in execution time, which leads to the energy consumption (see Figure~\ref{pipeline_energy}) being similar to the the single-thread execution (in some cases marginally less energy as shown for soap.xml). 
[1362]108
[1692]109\begin{figure*}[htbp]
[1412]110\subfigure[Performance (Cycles / kB)]{
[1346]111\includegraphics[width=0.32\textwidth]{plots/pipeline_performance.pdf}
112\label{pipeline_performance}
[1335]113}
[1693]114\subfigure[Avg. Power Consumption (Watts)]{
[1346]115\includegraphics[width=0.32\textwidth]{plots/pipeline_power.pdf}
116\label{pipeline_power}
[1335]117}
118\subfigure[Avg. Energy Consumption (nJ / Byte)]{
[1346]119  \includegraphics[width=0.32\textwidth]{plots/pipeline_energy.pdf}
120\label{pipeline_energy}
[1335]121}
122\caption{Multithreaded Parabix}
123\label{multithread_perf}
[1692]124\end{figure*}
[1302]125
Note: See TracBrowser for help on using the repository browser.