# Changeset 1362

Ignore:
Timestamp:
Aug 23, 2011, 6:38:17 PM (8 years ago)
Message:

first parse. Pipeline stage

Location:
docs/HPCA2012
Files:
2 edited

### Legend:

Unmodified
 r1359 \section{Multi-threaded Parabix} \section{Multithreaded Parabix} \label{section:multithread} The general problem of addressing performance through multicore parallelism is the increasing energy cost. As discussed in previous sections, Parabix, which applies SIMD-based techniques can not only achieves better performance but consumes less energy. Moreover, using mulitiple cores, we can further improve the performance of Parabix while keeping the energy consumption at the same level. The general challenge with boosting performance through multicore parallelism is the power and energy overheads. When we put more cores to work on a chip it results in proportionate increase in power. Unfortunately, due to the runtime overheads associated with thread management and data synchronization it is very hard to obtain corresponding improvements in performance resulting in increased energy costs.  Parabix, which exploits intra-core fine-grain SIMD-based can improve performance and achieve corresponding improvements in energy by improving the overall compute efficiency. However, Parabix is restricted to the resources available within a single core. In this section we have parallelized the Parabix XML parser by hand to study the effects of thread level parallelism in conjunction with Parabix's data parallelism. A typical approach to parallelizing software, data parallelism, requires nearly independent data, However, the nature of XML files makes them hard to partition nicely for data parallelism.  Several approaches have been used to address this problem.  A preparsing phase has been proposed to help partition the XML document \cite{dataparallel}.  The goal of this preparsing is to determine the tree structure of the XML document so that it can be used to guide the full parsing in the next phase.  Another data parallel algorithm is called ParDOM \cite{Shah:2009}.  It first builds partial DOM node tree structures for each data segments and then link them using preorder numbers that has been assigned to each start element to determine the ordering among siblings and a stack to manage the parent-child relationship between elements. Data parallelism approaches introduce a lot of overheads to solve the data dependencies between segments.  Therefore, instead of partitioning the data into segments and assigning different data segments to different cores, we propose a pipeline parallelism strategy that partitions the process into several stages and let each core work with one single stage. A typical approach to handling data parallelism with multiple threads, requires partitioning data uniformly across the threads. However XML parsing inherently moves through data in sequence, creates data dependencies between the different phases and makes it hard to partition the file. A few attempts have been made to address this problem using a pre-parsing phase to  help determine the tree structure and to partition the XML document~\cite{dataparallel}. There have been other approaches to speculatively partition the data~\cite{Shah:2009} but this requires introducing significant complexity in the overall logic of the program. The interface between stages is implemented using a circular array, where each entry consists of all ten data structures for one segment as listed in Table \ref{pass_structure}.  Each thread keeps an index of the array ($I_N$), which is compared with the index ($I_{N-1}$) kept by its previous thread before processing the segment.  If $I_N$ is smaller than $I_{N-1}$, thread N can start processing segment $I_N$, otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is larger than $I_N$.  The time consumed by continuously loading the value of $I_{N-1}$ and comparing it with $I_N$ will be later referred as stall time.  When a thread finishes processing the segment, it increases the index by one. We adopt a contrasting approach to parallelizing the Parabix XML parser.  As described in Section~\ref{} Parabix consists of multiple passes that on every chunk of input data and each of these stages interact in sequence with no data movement from later to earlier passes. This fits well into the mold of pipeline parallelism and we partition the overall parser into several stages and assign a core to each to stage. One of the key challenges is to evaluate which passes need to be grouped into a stage. We analyzed the latency of each of the passes in the single-threaded version of Parabix XML parser (Column 1 in Table~\ref{pass_structure}) and assigned the passes them to stages such that the stage latency of the overall pipeline in balanced resulting in maximal throughput. The interface between stages is implemented using a ring buffer, where each entry consists of all ten data structures for one segment as listed in Table \ref{pass_structure}.  Each pipeline stage S maintains the index of the buffer entry ($I_S$) that is being processed. Before processing the next buffer frame the stage check if the previous stage is done by spinning on $I_{S-1}$ (Stage S-1's buffer entry). In commodity multicore chips typically all threads share the last level cache and if we let faster pipeline stages run ahead the data they process will increase contention to the shared cache. To prevent this we optimize how far the faster pipeline stages can run ahead by controlling the overall size of the ring buffer. The faster stage if it runs ahead will effectively cause the ring buffer to fill up and cause the stage implicitly stall. \begin{table*}[t] \end{table*} Figure \ref{multithread_perf} demonstrates the XML well-formedness checking performance of the multi-threaded Parabix in comparison with the single-threaded version. The multi-threaded Parabix is more than two times faster and runs at 2.7 cycles per input byte on the \SB{} machine. Figure~\ref{multithread_perf} demonstrates the performance improvement achieved by pipelined Parabix in comparison with the single-threaded version.  The multithreaded is $\simeq$2$\times$ faster compared to the single threaded version and achieves $\simeq$ 2.7 cycles per input byte by exploiting SIMD units of all \SB{}'s cores.  This performance  approaches the performance of custom hardware solutions. Parabix demonstrates the potential to enable an entire new class of applications, text processing, to exploit multicores Figure \ref{pipeline_power} shows the average power consumed by the multi-threaded Parabix in comparison with the single-threaded version. By running four threads and using all the cores at the same time, the power consumption of the multi-threaded Parabix is much higher than the single-threaded version. However, the energy consumption is about the same, because the multi-threaded Parabix needs less processing time. In fact, as shown in Figure \ref{pipeline_energy}, parsing soap.xml using multi-threaded Parabix consumes less energy than using single-threaded Parabix. Figure \ref{pipeline_power} shows the average power consumed by the multithreaded Parabix. Overall, as expected the power consumption of the multithreaded Parabix increases in proportion to the number of active cores. Note that the increase is not linear, since shared units such as last-level-caches consume active power even if a single core is active. Perhaps more interestingly we achieve corresponding reduction in execution time which leads to same energy consumption as the single-thread version (in some cases marginally less energy as shown for soap.xml). \begin{figure}