source: docs/Working/icXML/multithread.tex @ 2529

Last change on this file since 2529 was 2525, checked in by cameron, 7 years ago

Expand multithreading section

File size: 6.7 KB
Line 
1%\section{Leveraging SIMD Parallelism for Multicore: Pipeline Parallelism}
2
3% As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine
4% Finite-state machine belongs to the hardest application class to parallelize and process efficiently
5% among all presented in Berkeley study reports \cite{Asanovic:EECS-2006-183}.
6% However, \icXML{} reconstructs Xerces and provides logical layers between modules,
7% which naturally enables pipeline parallel processing.
8
9As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine.
10As an application class, finite-state machines are considered very difficult to parallelize
11and have been termed ``embarassingly sequential.'' \cite{Asanovic:EECS-2006-183}.
12However, \icXML{} is designed to organize processing into logical layers that
13are separable.   In particular, layers within the \PS{} are designed to operate
14over significant segments of input data before passing their outputs on for
15subsequent processing.  This fits well into the general model of pipeline
16parallelism, in which each thread is in charge of a single module or group
17of modules.
18
19The most straightforward division of work in \icXML{} is to separate
20the \PS{} and the \MP{} into distinct logical layers in a two-stage pipeline.
21In this case, the \PS{} thread $T_1$ reads 16k of XML input $I$ at a time and produces the
22content, symbol and URI streams, then stores them in a pre-allocated shared data structure $S$.
23The \MP{} thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation,
24and the provides parsed XML data to the application through the Xerces API. 
25The shared data structure is implemented using a ring buffer,
26where every entry contains an independent set of data streams.
27In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
28A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
29In Figure \ref{threads_timeline1} the processing time of $T_1$ is longer than $T_2$;
30thus $T_2$ always waits for $T_1$ to write to the shared memory.
31Figure \ref{threads_timeline2} illustrates the scenario in which $T_1$ is faster
32and must wait for $T_2$ to finish reading the shared data before it can reuse the memory space.
33
34\begin{figure}
35
36\subfigure[]{
37\includegraphics[width=0.48\textwidth]{plots/threads_timeline1.pdf}
38\label{threads_timeline1}
39}
40\hfill
41\subfigure[]{
42\includegraphics[width=0.48\textwidth]{plots/threads_timeline2.pdf}
43}
44\caption{Thread Balance in Two-Stage Pipelines}
45\label{threads_timeline2}
46
47\end{figure}
48
49% In our pipeline model, each thread is in charge of one module or one group of modules.
50% A straight forward division is to take advantage of the layer between \PS{} and \MP{}.
51% In this case, the first thread $T_1$ will read 16k of XML input $I$ at a time
52% and process all the modules in \PS{} to generates
53% content buffer, symbol array, URI array, and store them to a pre-allocated shared data structure $S$.
54% The second thread $T_2$ consumes the data provided by the first thread and
55% goes through all the modules in Markup Processor and writes output $O$.
56
57% The shared data structure is implemented using a ring buffer,
58% where each entry consists of all the arrays shared between the two threads with size of 160k.
59% In the example of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
60% A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
61% In Figure \ref{threads_timeline1}, the processing time of the first thread is longer,
62% thus the second thread always wait for the first thread to finish processing one chunk of input
63% and write to the shared memory.
64% Figure \ref{threads_timeline2} illustrates a different situation where the second thread is slower
65% and the first thread has to wait for the second thread finishing reading the shared data before it can reuse the memory space.
66
67Overall, our design assumption is that an accelerated Xerces parser will be
68most significant for applications that themselves perform substantial
69processing on the parsed XML data delivered.  Our design is intended for
70a range of applications ranging between two design points.   The first
71design point is one in which XML parsing cost handled by the
72\PS{} dominates at 67\% of the overall
73cost, with the cost of application processing (including the driver logic
74withinn the \MP{}) still being quite significant
75at 33\%.   The second is almost the reverse scenario, the cost of application processing
76dominates at 60\% of the overall cost, while the overall cost of parsing represents
77an overhead of 40\%.
78
79Our design is also predicated on a goal of using the Parabix
80framework to achieve achieving a 50\% to 100\% improvement
81in the parsing engine itself.   Our best case scenario is
82a 100\% improvement of the \PS{} for the design point in which
83XML parsing dominates at 67\% of the total application cost.
84In this case, single-threaded \icXML{} should achieve a 50\% speedup
85so that the total application cost reduces to 67\% of the original. 
86However, with our two-stage pipeline model, our ideal scenario
87gives us two well-balanced threads each performing about 33\% of the
88original work.   In this case, Amdahl's law predicts that
89we could expect up to a 3X speedup, at best.
90
91At the other extreme of our design range, we consider an application
92in which core parsing cost is 40\%.   Assuming the 2X speedup of
93the \PS{} over the corresponding Xerces core, single-threaded
94\icXML{} delivers a 25\% speedup.   However, the most significant
95aspect of our two-stage multithreaded design then becomes the
96ability to hide the entire latency of parsing within the serial time
97required by the application.   In this case, we achieve
98an overall speedup in processing time by 1.67x.
99
100Although the structure of the \PS{} allows division of the work into
101several pipeline stages and has been demonstrated to be effective
102for four pipeline stages in a research prototype \cite{HPCA2012},
103our analysis here suggests that the further pipelining of work within
104the \PS{} is not worthwhile if the cost of application logic is little as
10533\% of the end-to-end cost using Xerces.  To achieve benefits of
106further parallelization with multicore technology, there would
107need to be reductions in the cost of application logic that
108could match reductions in core parsing cost.
109
110% \begin{figure}
111% \includegraphics[width=0.45\textwidth]{plots/threads_timeline1.pdf}
112% \caption{}
113% \label{threads_timeline1}
114% \end{figure}
115%
116% \begin{figure}
117% \includegraphics[width=0.45\textwidth]{plots/threads_timeline2.pdf}
118% \caption{}
119% \label{threads_timeline2}
120% \end{figure}
121
Note: See TracBrowser for help on using the repository browser.