Changeset 2525


Ignore:
Timestamp:
Oct 20, 2012, 2:42:58 PM (7 years ago)
Author:
cameron
Message:

Expand multithreading section

Location:
docs/Working/icXML
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/icxml-main.tex

    r2524 r2525  
    151151\input{arch-errorhandling.tex}
    152152
    153 \section{Leveraging Pipeline Parallelism}
     153\section{Multithreading with Pipeline Parallelism}
    154154\label{multithread}
    155155
  • docs/Working/icXML/multithread.tex

    r2522 r2525  
    77% which naturally enables pipeline parallel processing.
    88
    9 As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine,
    10 the hardest type of application to parallelize and process efficiently \cite{Asanovic:EECS-2006-183}.
    11 However \icXML{} provides logical layers between modules, which naturally enables pipeline parallel processing.
     9As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine.
     10As an application class, finite-state machines are considered very difficult to parallelize
     11and have been termed ``embarassingly sequential.'' \cite{Asanovic:EECS-2006-183}.
     12However, \icXML{} is designed to organize processing into logical layers that
     13are separable.   In particular, layers within the \PS{} are designed to operate
     14over significant segments of input data before passing their outputs on for
     15subsequent processing.  This fits well into the general model of pipeline
     16parallelism, in which each thread is in charge of a single module or group
     17of modules.
    1218
    13 In the pipeline model, each thread is in charge of a group of modules.
    14 A straight forward division is to take advantage of the layer between \PS{} and \MP{}.
    15 In this case, the first thread $T_1$ reads 16k of XML input $I$ at a time and produces the
     19The most straightforward division of work in \icXML{} is to separate
     20the \PS{} and the \MP{} into distinct logical layers in a two-stage pipeline.
     21In this case, the \PS{} thread $T_1$ reads 16k of XML input $I$ at a time and produces the
    1622content, symbol and URI streams, then stores them in a pre-allocated shared data structure $S$.
    17 The second thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation
    18 then generates the output $O$.
     23The \MP{} thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation,
     24and the provides parsed XML data to the application through the Xerces API. 
    1925The shared data structure is implemented using a ring buffer,
    2026where every entry contains an independent set of data streams.
    21 In the example of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
     27In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
    2228A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
    2329In Figure \ref{threads_timeline1} the processing time of $T_1$ is longer than $T_2$;
     
    2632and must wait for $T_2$ to finish reading the shared data before it can reuse the memory space.
    2733
     34\begin{figure}
     35
     36\subfigure[]{
     37\includegraphics[width=0.48\textwidth]{plots/threads_timeline1.pdf}
     38\label{threads_timeline1}
     39}
     40\hfill
     41\subfigure[]{
     42\includegraphics[width=0.48\textwidth]{plots/threads_timeline2.pdf}
     43}
     44\caption{Thread Balance in Two-Stage Pipelines}
     45\label{threads_timeline2}
     46
     47\end{figure}
    2848
    2949% In our pipeline model, each thread is in charge of one module or one group of modules.
     
    4565% and the first thread has to wait for the second thread finishing reading the shared data before it can reuse the memory space.
    4666
    47 To understand the performance improvement that can be achieved by this pipeline model,
    48 first suppose we have an application that spends 66\% of time on parsing when it applies Xerces XML parser
    49 and icXML can reduce the parsing time by half.
    50 According to Amdahl's law, the speedup by using icXML is 1.5x.
    51 When pipeline model is applied to icXML and a separate core is used for parsing,
    52 the entire parsing time can be hidden, thus the application is 3x faster.
    53 However, when parsing time does not dominate the performance,
    54 e.g. 40\% of the total time is used for parsing,
    55 we can only speedup the processing time by 1.67x.
     67Overall, our design assumption is that an accelerated Xerces parser will be
     68most significant for applications that themselves perform substantial
     69processing on the parsed XML data delivered.  Our design is intended for
     70a range of applications ranging between two design points.   The first
     71design point is one in which XML parsing cost handled by the
     72\PS{} dominates at 67\% of the overall
     73cost, with the cost of application processing (including the driver logic
     74withinn the \MP{}) still being quite significant
     75at 33\%.   The second is almost the reverse scenario, the cost of application processing
     76dominates at 60\% of the overall cost, while the overall cost of parsing represents
     77an overhead of 40\%.
    5678
    57 \begin{figure}
     79Our design is also predicated on a goal of using the Parabix
     80framework to achieve achieving a 50\% to 100\% improvement
     81in the parsing engine itself.   Our best case scenario is
     82a 100\% improvement of the \PS{} for the design point in which
     83XML parsing dominates at 67\% of the total application cost.
     84In this case, single-threaded \icXML{} should achieve a 50\% speedup
     85so that the total application cost reduces to 67\% of the original. 
     86However, with our two-stage pipeline model, our ideal scenario
     87gives us two well-balanced threads each performing about 33\% of the
     88original work.   In this case, Amdahl's law predicts that
     89we could expect up to a 3X speedup, at best.
    5890
    59 \subfigure[]{
    60 \includegraphics[width=0.48\textwidth]{plots/threads_timeline1.pdf}
    61 \label{threads_timeline1}
    62 }
    63 \hfill
    64 \subfigure[]{
    65 \includegraphics[width=0.48\textwidth]{plots/threads_timeline2.pdf}
    66 }
    67 \caption{}
    68 \label{threads_timeline2}
     91At the other extreme of our design range, we consider an application
     92in which core parsing cost is 40\%.   Assuming the 2X speedup of
     93the \PS{} over the corresponding Xerces core, single-threaded
     94\icXML{} delivers a 25\% speedup.   However, the most significant
     95aspect of our two-stage multithreaded design then becomes the
     96ability to hide the entire latency of parsing within the serial time
     97required by the application.   In this case, we achieve
     98an overall speedup in processing time by 1.67x.
    6999
    70 \end{figure}
     100Although the structure of the \PS{} allows division of the work into
     101several pipeline stages and has been demonstrated to be effective
     102for four pipeline stages in a research prototype \cite{HPCA2012},
     103our analysis here suggests that the further pipelining of work within
     104the \PS{} is not worthwhile if the cost of application logic is little as
     10533\% of the end-to-end cost using Xerces.  To achieve benefits of
     106further parallelization with multicore technology, there would
     107need to be reductions in the cost of application logic that
     108could match reductions in core parsing cost.
    71109
    72110% \begin{figure}
Note: See TracChangeset for help on using the changeset viewer.