# Changeset 1774 for docs/HPCA2012/final_ieee/09-pipeline.tex

Ignore:
Timestamp:
Dec 13, 2011, 4:50:42 PM (8 years ago)
Message:

minor changes

File:
1 edited

### Legend:

Unmodified
 r1743 Even if an application is infinitely parallelizable and thread synchronization costs are non-existent, all applications are constrained by the power and energy overheads incurred when utilizing multiple cores: the power and energy overheads incurred when utilizing multiple cores; as more cores are put to work, a proportional increase in power occurs. Unfortunately, due to the runtime overheads associated with The typical approach to handling data parallelism with multiple threads involves partitioning data uniformly across the threads. However XML involves partitioning data uniformly across the threads. However, XML parsing is inherently sequential, which makes it difficult to partition the data. Several attempts have been made to address this problem using a preparsing phase to help determine the tree structure problem. For example, using a preparsing phase to help determine the tree structure and to partition the XML document accordingly~\cite{dataparallel}. Another approach involved speculatively partitioning the data~\cite{Shah:2009} but partitioned Parabix-XML into four stages and assigned a core to each to stage. One of the key challenges was to determine which passes should be grouped together. By analyzing the latency and data dependencies of each of the passes in the single-threaded version of Parabix-XML (Column 3 in Table~\ref{pass_structure}), and assigned the passes to stages such that provided the maximal throughput. should be grouped together. We analyzed the latency and data dependencies of each of the passes in the single-threaded version of Parabix (Column 3 in Table~\ref{pass_structure}), and assigned the passes to stages to maximize throughput. controlling the overall size of the ring buffer. Whenever a faster stage runs ahead, it will effectively cause the ring buffer to fill up and force that stage to stall. Experiments show that 6 entries of the force that stage to stall. Experiments show that six entries of the circular buffer gives the best performance. single-threaded version.  The 4-threaded version is $\simeq2\times$ faster compared to the single threaded version and achieves $\simeq2.7$ cycles per input byte by exploiting SIMD units of all $\simeq2.7$ cycles per input byte by exploiting the SIMD units of all \SB{}'s cores.  This performance approaches the 1 cycle per byte performance of custom hardware solutions~\cite{DaiNiZhu2010}. Parabix