Last change on this file since 2496 was 2496, checked in by nmedfort, 7 years ago

temp checkin

File size: 3.5 KB
Line
1\section{Leveraging SIMD Parallelism for Multicore: Pipeline Parallelism}
2\subsection{Pipeline Strategy for \icXML{}}
3% As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine
4% Finite-state machine belongs to the hardest application class to parallelize and process efficiently
5% among all presented in Berkeley study reports \cite{Asanovic:EECS-2006-183}.
6% However, \icXML{} reconstructs Xerces and provides logical layers between modules,
7% which naturally enables pipeline parallel processing.
8
9As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine,
10the hardest type of application to parallelize and process efficiently \cite{Asanovic:EECS-2006-183}.
11However, \icXML{} provides logical layers between modules,
12which naturally enables pipeline parallel processing.
13
14In our pipeline model, each thread is in charge of one module or one group of modules.
15A straight forward division is to take advantage of the layer between Parabix Subsystem and Markup Processor.
16In this case, the first thread $T_1$ will read 16k of XML input $I$ at a time
17and process all the modules in Parabix Subsystem to generates
18% content buffer, symbol array, URI array, context ID array and store them to a pre-allocated shared data structure $S$.
19content buffer, symbol array, URI array, and store them to a pre-allocated shared data structure $S$.
20The second thread $T_2$ consumes the data provided by the first thread and
21goes through all the modules in Markup Processor and writes output $O$.
22
23The shared data structure is implemented using a ring buffer,
24where each entry consists of all the arrays shared between the two threads with size of 160k.
25In the example of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
26A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
27In Figure \ref{threads_timeline1}, the processing time of the first thread is longer,
28thus the second thread always wait for the first thread to finish processing one chunk of input
29and write to the shared memory.
30Figure \ref{threads_timeline2} illustrates a different situation where the second thread is slower
31and the first thread has to wait for the second thread finishing reading the shared data before it can reuse the memory space.
32
33\begin{figure}
35\caption{}
37\end{figure}
38\clearpage
39
40\begin{figure}
42\caption{}
44\end{figure}
45\clearpage
46
47\subsection{Performance Comparison}
48
49
50\begin{figure}
51\begin{center}
55\end{center}
56
57\end{figure}
58\begin{figure}
62\end{figure}
63
64\begin{figure}
65\begin{center}