# Changeset 1412

Ignore:
Timestamp:
Aug 31, 2011, 6:40:45 PM (8 years ago)
Message:

edits

File:
1 edited

### Legend:

Unmodified
 r1407 \section{Multithreaded Parabix} \label{section:multithread} The general challenge with boosting performance through multicore parallelism is the power and energy overheads. When we put more cores to work on a chip it results in proportionate increase in power. Unfortunately, due to the runtime overheads associated with thread management and data synchronization it is very hard to obtain Even if an application is infinitely parallelizable and thread synchronization costs are non-existent, all applications are constrained by the power and energy overheads incurred when utilizing multiple cores: as more cores are put to work, a proportional increase in power occurs. Unfortunately, due to the runtime overheads associated with thread management and data synchronization, it is very hard to obtain corresponding improvements in performance resulting in increased energy costs.  Parabix, which exploits intra-core fine-grain SIMD-based can improve performance and achieve corresponding improvements in energy by improving the overall compute efficiency. However, Parabix is restricted to the resources available within a single core. In this section we have parallelized the Parabix XML parser by hand to study the effects of thread level parallelism in conjunction with Parabix's data parallelism. energy costs. Parabix-XML can improve performance and reduce energy consumption by improving the overall computation efficiency. However, up to this point, we restricted Parabix-XML to a single core. In this section, we discuss our parallelized version of Parabix-XML to study the effects of thread-level parallelism in conjunction with Parabix-XML's data parallelism. A typical approach to handling data parallelism with multiple threads, requires partitioning data uniformly across the threads. However XML parsing inherently moves through data in sequence, creates data dependencies between the different phases and makes it hard to partition the file. A few attempts have been made to address this problem using a pre-parsing phase to  help determine the tree structure and to partition the XML document~\cite{dataparallel}. There have been other approaches to speculatively partition the data~\cite{Shah:2009} but this requires introducing significant complexity in the overall logic of the program. The typical approach to handling data parallelism with multiple threads involves partitioning data uniformly across the threads. However XML parsing is inherently sequential, which makes it difficult to partition the data. Several attempts have been made to address this problem using a preparsing phase to help determine the tree structure and to partition the XML document accordingly~\cite{dataparallel}. Another approach involved speculatively partitioning the data~\cite{Shah:2009} but this introduced a significant level of complexity into the overall logic of the program. \begin{table*}[!h] \begin{tabular}{|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|} \hline &      & & \multicolumn{10}{|c|}{Data Structures}\\ \hline &      & & \multicolumn{10}{|c|}{Data Structure Flow / Dependencies}\\ \hline &      &                & data\_buffer& basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & err\_streams\\ \hline & latency(C/B)     & size (B)       & 128         & 128         & 496  & 448   & 80    & 176    & 112    & 176    & 16         & 112           \\ \hline \end{table*} We adopt a contrasting approach to parallelizing the Parabix XML parser.  As described in Section~\ref{section:parser} Parabix consists of multiple passes that on every chunk of input data and each of these stages interact in sequence with no data movement from later to earlier passes. This fits well into the mold of pipeline parallelism and we partition the overall parser into several stages and assign a core to each to stage. One of the key challenges is to evaluate which passes need to be grouped into a stage. We analyzed the latency of each of the passes in the single-threaded version of Parabix XML parser (Column 1 in Table~\ref{pass_structure}) and assigned the passes them to stages such that the stage latency of the overall pipeline in balanced resulting in maximal throughput. In contrast to those methods, we adopted a parallelism strategy that requires neither speculation nor pre-parsing. As described in Section~\ref{section:parser}, Parabix-XML consists of multiple passes that, on every chunk of input data, interact with each other in sequence with no data movement from later to earlier passes. This fits well into the mold of pipeline parallelism. We partitioned Parabix-XML into four stages and assigned a core to each to stage. One of the key challenges was to determine which passes should be grouped together. By analyzing the latency and data dependencies of each of the passes in the single-threaded version of Parabix-XML (Column 1 in Table~\ref{pass_structure}), and assigned the passes to stages such that that provided the maximal throughput. The interface between stages is implemented using a ring buffer, where each entry consists of all ten data structures for one segment as listed in Table \ref{pass_structure}.  Each pipeline stage S maintains listed in Table \ref{pass_structure}.  Each pipeline stage $S$ maintains the index of the buffer entry ($I_S$) that is being processed. Before processing the next buffer frame the stage check if the previous stage is done by spinning on $I_{S-1}$ (Stage S-1's buffer entry). In is done by spinning on $I_{S-1}$ (Stage $S-1$'s buffer entry). In commodity multicore chips typically all threads share the last level cache and if we let faster pipeline stages run ahead the data they cache. If we let the faster pipeline stages run ahead, the data they process will increase contention to the shared cache. To prevent this we optimize how far the faster pipeline stages can run ahead by controlling the overall size of the ring buffer. The faster stage if it runs ahead will effectively cause the ring buffer to fill up and cause the stage implicitly stall. we limit how far the faster pipeline stages can run ahead by controlling the overall size of the ring buffer. Whenever a faster stage runs ahead, it will effectively cause the ring buffer to fill up and force that stage to stall. Figure~\ref{multithread_perf} demonstrates the performance improvement achieved by pipelined Parabix in comparison with the single-threaded version.  The multithreaded is $\simeq$2$\times$ faster compared to the single threaded version and achieves $\simeq$ 2.7 cycles per input byte by exploiting SIMD units of all \SB{}'s cores.  This performance  approaches the performance of custom hardware solutions. Parabix demonstrates the achieved by pipelined Parabix-XML in comparison with the single-threaded version.  The 4-threaded version is $\simeq2\times$ faster compared to the single threaded version and achieves $\simeq2.7$ cycles per input byte by exploiting SIMD units of all \SB{}'s cores.  This performance approaches the 1 cycle per byte performance of custom hardware solutions \ref{DaiNiZhu2010}. Parabix demonstrates the potential to enable an entire new class of applications, text processing, to exploit multicores. Figure \ref{pipeline_power} shows the average power consumed by the multithreaded Parabix. Overall, as expected the power consumption of the multithreaded Parabix increases in proportion to the number of multithreaded Parabix. Overall, as expected the power consumption increases in proportion to the number of active cores. Note that the increase is not linear, since shared units such as last-level-caches consume active power even if a single core is active. Perhaps more interestingly we achieve corresponding reduction in execution time which leads to same energy consumption as such as last-level-caches consume active power even if only one core is active. Perhaps more interestingly there was a corresponding reduction in execution time, which leads to a similar energy consumption as the single-thread version (in some cases marginally less energy as shown for soap.xml). \begin{figure} \subfigure[Performance (Cycles / kByte)]{ \subfigure[Performance (Cycles / kB)]{ \includegraphics[width=0.32\textwidth]{plots/pipeline_performance.pdf} \label{pipeline_performance}