# Changeset 2525

Ignore:
Timestamp:
Oct 20, 2012, 2:42:58 PM (7 years ago)
Message:

Location:
docs/Working/icXML
Files:
2 edited

### Legend:

Unmodified
 r2522 % which naturally enables pipeline parallel processing. As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine, the hardest type of application to parallelize and process efficiently \cite{Asanovic:EECS-2006-183}. However \icXML{} provides logical layers between modules, which naturally enables pipeline parallel processing. As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine. As an application class, finite-state machines are considered very difficult to parallelize and have been termed embarassingly sequential.'' \cite{Asanovic:EECS-2006-183}. However, \icXML{} is designed to organize processing into logical layers that are separable.   In particular, layers within the \PS{} are designed to operate over significant segments of input data before passing their outputs on for subsequent processing.  This fits well into the general model of pipeline parallelism, in which each thread is in charge of a single module or group of modules. In the pipeline model, each thread is in charge of a group of modules. A straight forward division is to take advantage of the layer between \PS{} and \MP{}. In this case, the first thread $T_1$ reads 16k of XML input $I$ at a time and produces the The most straightforward division of work in \icXML{} is to separate the \PS{} and the \MP{} into distinct logical layers in a two-stage pipeline. In this case, the \PS{} thread $T_1$ reads 16k of XML input $I$ at a time and produces the content, symbol and URI streams, then stores them in a pre-allocated shared data structure $S$. The second thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation then generates the output $O$. The \MP{} thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation, and the provides parsed XML data to the application through the Xerces API. The shared data structure is implemented using a ring buffer, where every entry contains an independent set of data streams. In the example of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries. In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries. A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time. In Figure \ref{threads_timeline1} the processing time of $T_1$ is longer than $T_2$; and must wait for $T_2$ to finish reading the shared data before it can reuse the memory space. \begin{figure} \subfigure[]{ \includegraphics[width=0.48\textwidth]{plots/threads_timeline1.pdf} \label{threads_timeline1} } \hfill \subfigure[]{ \includegraphics[width=0.48\textwidth]{plots/threads_timeline2.pdf} } \caption{Thread Balance in Two-Stage Pipelines} \label{threads_timeline2} \end{figure} % In our pipeline model, each thread is in charge of one module or one group of modules. % and the first thread has to wait for the second thread finishing reading the shared data before it can reuse the memory space. To understand the performance improvement that can be achieved by this pipeline model, first suppose we have an application that spends 66\% of time on parsing when it applies Xerces XML parser and icXML can reduce the parsing time by half. According to Amdahl's law, the speedup by using icXML is 1.5x. When pipeline model is applied to icXML and a separate core is used for parsing, the entire parsing time can be hidden, thus the application is 3x faster. However, when parsing time does not dominate the performance, e.g. 40\% of the total time is used for parsing, we can only speedup the processing time by 1.67x. Overall, our design assumption is that an accelerated Xerces parser will be most significant for applications that themselves perform substantial processing on the parsed XML data delivered.  Our design is intended for a range of applications ranging between two design points.   The first design point is one in which XML parsing cost handled by the \PS{} dominates at 67\% of the overall cost, with the cost of application processing (including the driver logic withinn the \MP{}) still being quite significant at 33\%.   The second is almost the reverse scenario, the cost of application processing dominates at 60\% of the overall cost, while the overall cost of parsing represents an overhead of 40\%. \begin{figure} Our design is also predicated on a goal of using the Parabix framework to achieve achieving a 50\% to 100\% improvement in the parsing engine itself.   Our best case scenario is a 100\% improvement of the \PS{} for the design point in which XML parsing dominates at 67\% of the total application cost. In this case, single-threaded \icXML{} should achieve a 50\% speedup so that the total application cost reduces to 67\% of the original. However, with our two-stage pipeline model, our ideal scenario gives us two well-balanced threads each performing about 33\% of the original work.   In this case, Amdahl's law predicts that we could expect up to a 3X speedup, at best. \subfigure[]{ \includegraphics[width=0.48\textwidth]{plots/threads_timeline1.pdf} \label{threads_timeline1} } \hfill \subfigure[]{ \includegraphics[width=0.48\textwidth]{plots/threads_timeline2.pdf} } \caption{} \label{threads_timeline2} At the other extreme of our design range, we consider an application in which core parsing cost is 40\%.   Assuming the 2X speedup of the \PS{} over the corresponding Xerces core, single-threaded \icXML{} delivers a 25\% speedup.   However, the most significant aspect of our two-stage multithreaded design then becomes the ability to hide the entire latency of parsing within the serial time required by the application.   In this case, we achieve an overall speedup in processing time by 1.67x. \end{figure} Although the structure of the \PS{} allows division of the work into several pipeline stages and has been demonstrated to be effective for four pipeline stages in a research prototype \cite{HPCA2012}, our analysis here suggests that the further pipelining of work within the \PS{} is not worthwhile if the cost of application logic is little as 33\% of the end-to-end cost using Xerces.  To achieve benefits of further parallelization with multicore technology, there would need to be reductions in the cost of application logic that could match reductions in core parsing cost. % \begin{figure}