Last change on this file since 2506 was 2506, checked in by lindanl, 7 years ago

File size: 4.5 KB
Line
1%\section{Leveraging SIMD Parallelism for Multicore: Pipeline Parallelism}
2\section{Leveraging Pipeline Parallelism}
3% As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine
4% Finite-state machine belongs to the hardest application class to parallelize and process efficiently
5% among all presented in Berkeley study reports \cite{Asanovic:EECS-2006-183}.
6% However, \icXML{} reconstructs Xerces and provides logical layers between modules,
7% which naturally enables pipeline parallel processing.
8
9As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine,
10the hardest type of application to parallelize and process efficiently \cite{Asanovic:EECS-2006-183}.
11However \icXML{} provides logical layers between modules, which naturally enables pipeline parallel processing.
12
13In the pipeline model, each thread is in charge of a group of modules.
14A straight forward division is to take advantage of the layer between \PS{} and \MP{}.
15In this case, the first thread $T_1$ reads 16k of XML input $I$ at a time and produces the
16content, symbol and URI streams, then stores them in a pre-allocated shared data structure $S$.
17The second thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation
18then generates the output $O$.
19The shared data structure is implemented using a ring buffer,
20where every entry contains an independent set of data streams.
21In the example of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
22A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
23In Figure \ref{threads_timeline1} the processing time of $T_1$ is longer than $T_2$;
24thus $T_2$ always waits for $T_1$ to write to the shared memory.
25Figure \ref{threads_timeline2} illustrates the scenario in which $T_1$ is faster
26and must wait for $T_2$ to finish reading the shared data before it can reuse the memory space.
27
28
29% In our pipeline model, each thread is in charge of one module or one group of modules.
30% A straight forward division is to take advantage of the layer between \PS{} and \MP{}.
31% In this case, the first thread $T_1$ will read 16k of XML input $I$ at a time
32% and process all the modules in \PS{} to generates
33% content buffer, symbol array, URI array, and store them to a pre-allocated shared data structure $S$.
34% The second thread $T_2$ consumes the data provided by the first thread and
35% goes through all the modules in Markup Processor and writes output $O$.
36
37% The shared data structure is implemented using a ring buffer,
38% where each entry consists of all the arrays shared between the two threads with size of 160k.
39% In the example of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
40% A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
41% In Figure \ref{threads_timeline1}, the processing time of the first thread is longer,
42% thus the second thread always wait for the first thread to finish processing one chunk of input
43% and write to the shared memory.
44% Figure \ref{threads_timeline2} illustrates a different situation where the second thread is slower
45% and the first thread has to wait for the second thread finishing reading the shared data before it can reuse the memory space.
46
47To understand the performance improvement that can be achieved by this pipeline model,
48first suppose we have an application that spends 66\% of time on parsing when it applies Xerces XML parser
49and icXML can reduce the parsing time by half.
50According to Amdahl's law, the speedup by using icXML is 1.5x.
51When pipeline model is applied to icXML and a separate core is used for parsing,
52the entire parsing time can be hidden, thus the application is 3x faster.
53However, when parsing time does not dominate the performance,
54e.g. 40\% of the total time is used for parsing,
55we can only speedup the processing time by 1.67x.
56
57\begin{figure}
58
59\subfigure[]{
62}
63\hfill
64\subfigure[]{
66}
67\caption{}
69
70\end{figure}
71
72% \begin{figure}