Ignore:
Timestamp:
Dec 13, 2011, 4:50:42 PM (8 years ago)
Author:
lindanl
Message:

minor changes

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/final_ieee/09-pipeline.tex

    r1743 r1774  
    33Even if an application is infinitely parallelizable and thread
    44synchronization costs are non-existent, all applications are constrained by
    5 the power and energy overheads incurred when utilizing multiple cores:
     5the power and energy overheads incurred when utilizing multiple cores;
    66as more cores are put to work, a proportional increase in power occurs.
    77Unfortunately, due to the runtime overheads associated with
     
    1717
    1818The typical approach to handling data parallelism with multiple threads
    19 involves partitioning data uniformly across the threads. However XML
     19involves partitioning data uniformly across the threads. However, XML
    2020parsing is inherently sequential, which makes it difficult to
    2121partition the data. Several attempts have been made to address this
    22 problem using a preparsing phase to help determine the tree structure
     22problem. For example, using a preparsing phase to help determine the tree structure
    2323and to partition the XML document accordingly~\cite{dataparallel}.
    2424Another approach involved speculatively partitioning the data~\cite{Shah:2009} but
     
    5353partitioned Parabix-XML into four stages and assigned a core to
    5454each to stage. One of the key challenges was to determine which passes
    55 should be grouped together. By analyzing the latency and data dependencies of each of
    56 the passes in the single-threaded version of Parabix-XML
    57 (Column 3 in Table~\ref{pass_structure}), and assigned the passes
    58 to stages such that provided the maximal throughput.
     55should be grouped together. We analyzed the latency and data dependencies of each of the passes
     56in the single-threaded version of Parabix (Column 3 in Table~\ref{pass_structure}),
     57and assigned the passes to stages to maximize throughput.
    5958
    6059
     
    7170controlling the overall size of the ring buffer. Whenever a faster stage
    7271runs ahead, it will effectively cause the ring buffer to fill up and
    73 force that stage to stall. Experiments show that 6 entries of the
     72force that stage to stall. Experiments show that six entries of the
    7473circular buffer gives the best performance.
    7574
     
    7877single-threaded version.  The 4-threaded version is $\simeq2\times$
    7978faster compared to the single threaded version and achieves
    80 $\simeq2.7$ cycles per input byte by exploiting SIMD units of all
     79$\simeq2.7$ cycles per input byte by exploiting the SIMD units of all
    8180\SB{}'s cores.  This performance approaches the 1 cycle per byte
    8281performance of custom hardware solutions~\cite{DaiNiZhu2010}. Parabix
Note: See TracChangeset for help on using the changeset viewer.