source: docs/Cmpt886_Project_Report/03-strategy.tex

Last change on this file was 1122, checked in by lindanl, 8 years ago

Add project report for cmpt886 on parallelizing Parabix2

File size: 7.5 KB
Line 
1\section{Parallelizing Strategy}
2\subsection{Pipelining Parallelism}
3The typical approach to parallelizing software (data parallelism)
4requires nearly independent data, which is a difficult task
5for dividing XML data. A simple division determined by the
6segment size can easily make most of the segments illegal
7according to the parsing rules while the data as a whole is legal.
8For example, Figure \ref{fig:sample_xml} is a legal XML file,
9but Figure \ref{fig:seg_xml}, which contains a incomplete start tag
10will not be considered as a legal piece.
11
12\begin{figure}[h]
13{
14\scriptsize
15\begin{verbatim}
16          country="Canada">
17    <name> Alice </name>
18    <address> XXX </address>
19  </shipTo>
20\end{verbatim}
21}
22\caption{XML Document Segment}
23\label{fig:seg_xml}
24\end{figure}
25
26Therefore, instead of dividing the data into segments and
27assigning different data segments to different cores,
28we divide the process into several stages and let each core work with one single stage.
29As shown in Figure \ref{dependency}, data dependencies exist
30between the same segment processed in different stages.
31Take the first segment S1 as an example, S1-B depends on S1-A, S1-C depends on S1-B and S1-D denpends on S1-C.
32There are also dependencies between different segments processed in the same stages.
33For example, S2-A depends on S1-A, S3-A depends on S2-A and so on so forth.
34However, there is no dependency between different segments processed in different stages, such as S1-A and S2-B.
35Thus, using a pipeline strategy shown in Figure \ref{pipeline}
36ensures that at any time slice, there is no dependency between each thread.
37
38
39\begin{figure}
40\begin{center}
41\includegraphics[width=0.5\textwidth]{plots/dependency.pdf}
42\end{center}
43\caption{Data Dependency}
44\label{dependency}
45\end{figure}
46
47\begin{figure}
48\begin{center}
49\includegraphics[width=0.5\textwidth]{plots/pipeline.pdf}
50\end{center}
51\caption{Pipelining Strategy}
52\label{pipeline}
53\end{figure}
54
55
56\subsection{Work Division}
57Ideally, no thread will stall when all of them process a given segment with the same speed,
58that is, for example, when T1 finishes with S6, T2 should complete with S5, T3 should complete with S4 and T4 should complete with S3.
59However, it is difficult to evenly divide up the work. As shown in Figure \ref{work_distribution},
60Parabix can be separated into eleven passes. The time consumed by each pass could be different depending on the characteristics of the test file.
61
62On a quad core machine, we divide the eleven passes into four stages based on
63the work distribution calculated by analyzing the sequential Parabix (Figure \ref{work_distribution}).
64StageA contains pass one to pass three, which are fill\_buffer, s2p and classify\_bytes.
65StageB contains pass four to pass seven, which are validate\_u8, gen\_scope, parse\_CtCDPI and parse\_ref.
66StageC contains pass eight to pass ten, which are parse\_tag, validate\_name and gen\_check.
67StageD contains the last pass, postprocessing.
68Table \ref{stage} shows the time consumed on each stage.
69Overhead will be introduced when each stage is running on a different core
70due to the resource contention and data migration, which will be discussed in the next section.
71Therefore, the processing time of each stage on a separate thread might be more than what is shown in the table.
72
73\begin{table}[h]
74\begin{center}
75\begin{tabular}{|c|c|c|c|c|c|}
76\hline
77       & dew    & jaw     & roads     & po     & soap\\ \hline
78stageA & 1.981  & 1.973   & 1.967     & 1.97   & 1.97\\ \hline
79stageB & 1.161  & 1.403   & 0.895     & 1.724  & 0.877\\ \hline
80stageC & 1.582  & 1.601   & 2.383     & 2.141  & 2.451\\ \hline
81stageD & 0.855  & 0.928   & 1.28      & 1.704  & 1.818\\ \hline
82
83\end{tabular}
84\end{center}
85\caption{Time Consumed on Each Stage under Sequential Processing (CPU cycles per byte)} 
86\label{stage} 
87\end{table}
88
89
90\begin{figure}
91\begin{center}
92\includegraphics[width=0.5\textwidth]{plots/work_distribution.pdf}
93\end{center}
94\caption{Work Distribution Between Each Pass}
95\label{work_distribution}
96\end{figure}
97
98\subsection{Circular Queue}
99
100\begin{table*}[t]
101\begin{center}
102\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|}
103\hline
104& & \multicolumn{10}{|c|}{Data Structures}\\ \hline
105&                & srcbuf & basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & check\_streams\\ \hline
106\multirow{StageA}
107&fill\_buffer    & write  &             &      &       &       &        &        &        &            &               \\ 
108&s2p             & read   & write       &      &       &       &        &        &        &            &               \\ 
109&classify\_bytes &        & read        &      & write &       &        &        &        &            &               \\ \hline
110\multirow{StageB}
111&validate\_u8    &        & read        & write&       &       &        &        &        &            &               \\ 
112&gen\_scope      &        &             &      & read  & write &        &        &        &            &               \\ 
113&parse\_CtCDPI   &        &             &      & read  & read  & write  &        &        &            & write         \\ 
114&parse\_ref      &        &             &      & read  & read  & read   & write  &        &            &               \\ \hline
115\multirow{StageC}
116&parse\_tag      &        &             &      & read  & read  & read   &        & write  &            &               \\ 
117&validate\_name  &        &             & read & read  &       & read   & read   & read   & write      & write         \\ 
118&gen\_check      &        &             & read & read  & read  & read   &        & read   & read       & write         \\ \hline
119\multirow{StageD}
120&postprocessing  & read   &             &      & read  &       & read   & read   &        &            & read          \\ \hline
121\end{tabular}
122\end{center}
123\caption{Relationship between Each Pass and Data Structures} 
124\label{pass_structure} 
125\end{table*}
126
127The interface between stages is implemented using a circular array,
128where each entry consists of all ten data structures for one segment as listed in Table \ref{pass_structure}.
129Each thread keeps an index of the array ($I_N$),
130which is compared with the index ($I_{N-1}$) kept by its previous thread before processing the segment.
131If $I_N$ is smaller than $I_{N-1}$, thread N can start processing segment $I_N$,
132otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is larger than $I_N$.
133The time consumed by continuously loading the value of $I_{N-1}$ and
134comparing it with $I_N$ will be later referred as stall time.
135When a thread finishes processing the segment, it increases the index by one.
136
137This algorithm has three basic sources of overhead. The first one is data migration.
138As shown in Table \ref{pass_structure}, many of the data structures
139are needed by more than one stage and hence processed by more than one core.
140Therefore, those data structures have to be loaded several times into different caches.
141The second overhead comes from maintaining the cache coherency of those data structures.
142Table \ref{pass_structure} shows the relationship between each pass and data structures.
143Fortunately, all of the data structures except check\_streams
144are written only once when they are brought into the cache the first time.
145Further optimization could be done by compressing some of the data structures
146as well as restructuring check\_streams such that it is written only by pass gen\_check.
147The third overhead is caused by the control data (index).
148Since $I_N$ is shared between thread N and thread N+1,
149each time thread N modifying $I_N$ will cause cache invalidation and
150when thread N+1 tries to read it, it generates a cache miss.
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
Note: See TracBrowser for help on using the repository browser.