source: docs/Cmpt886_Project_Report/04-evaluation.tex @ 4476

Last change on this file since 4476 was 1122, checked in by lindanl, 9 years ago

Add project report for cmpt886 on parallelizing Parabix2

File size: 6.8 KB
Line 
1\section{Evaluation}
2\subsection{Test Data and Platform}
3
4Distinguishing between ``document-oriented'' XML and ``data-oriented'' XML
5is a popular way to describe the two basic classes of XML documents.
6Data-oriented XML is used as an interchange format. Document-oriented
7XML is used to impose structure on information that rarely fits neatly
8into a relational database--particularly information intended for publishing.
9Data-oriented XML are characterized by a higher markup density.
10Markup density is defined as the ratio of the total markup contained
11within an XML file to the total XML document size.  This metric may have
12substantial influence on the performance of XML parsing.
13As such we choose workloads with distinguishable markup densities.
14Table \ref{XMLDocChars} shows the document characteristics of the XML
15instances selected for this performance study.
16
17\begin{table*}
18\begin{center}
19\begin{tabular}{|c||r|r|r|r|r|}
20\hline
21File Name               & dew.xml               & jaw.xml               & roads.gml     & po.xml        & soap.xml \\ \hline   
22File Type               & document              & document              & data          & data          & data   \\ \hline     
23File Size (kB)          & 66240                 & 7343                  & 11584         & 76450         & 2717 \\ \hline
24Markup Density          & 0.07                  & 0.13                  & 0.57          & 0.76          & 0.87  \\ \hline
25\end{tabular}
26\end{center}
27\caption{XML Document Characteristics} 
28\label{XMLDocChars} 
29\end{table*}
30
31All the experiments are run on a quad core machine with Linux kernel 2.6.35.
32Table \ref{machineinfo} gives the hardware description of the machine selected.
33\begin{table}[h]
34\begin{center}
35\begin{tabular}{|c||c|}
36\hline
37Processor & Intel Sandybridge i5-2300 (2.80GHz) \\ \hline
38L1 Cache &  4 X 32KB I-Cache 2 X 32KB D-Cache\\ \hline 
39L2 Cache &  4 X 256KB \\ \hline
40L3 Cache & 6-MB \\ \hline
41Front Side Bus &  1333 MHz\\ \hline
42Memory  &  6GB DDDR\\ \hline
43Max TDP & 95W \\ \hline
44
45\end{tabular}
46\end{center}
47\caption{Machine} 
48\label{machineinfo} 
49\end{table}
50
51\subsection{Parameters}
52\subsubsection{Segment Size}
53Increasing the segment size reduces the synchronization overhead.
54As shown in Figure \ref{para_segsize}, the processing time drops dramatically as the segment size goes from 128 bytes to 2KB.
55However, the performance cease to improve, actually slightly degrades when segment size is larger than 16KB because of cache contention.
56In real applications, we would like to use as little memory as possible without hurting much of the performance.
57The rest of the experiments are run using segment size 16KB.
58
59\begin{figure}
60\begin{center}
61\includegraphics[width=0.5\textwidth]{plots/para_segsize.pdf}
62\end{center}
63\caption{Processing Time with Different Segment Size (x axis: byte, y axis: CPU cycles per byte)}
64\label{para_segsize}
65\end{figure}
66
67\subsubsection{Circular Array Size}
68When the circular array size $C$ is smaller than the number of threads,
69only $C$ threads will be able to do useful work at the same time.
70As shown in Figure \ref{para_arrayentry}, when there are only two entries,
71the performance is even worse than the sequential Parabix.
72When the circular array size is larger than number of threads,
73increasing the size of the circular array allows threads to be pushed deeper into the queue.
74Then if one thread runs faster than the following thread,
75it has more time to process before it must stop and wait.
76To some extent, this can even out the variations of processing time.
77However, a much larger array size won't help because allocating a larger memory area can degrade the performance and
78the processing time is depending on the slowest stage in the pipeline not the fast one.
79Therefore, the rest of the experiments are run using only 6 entries.
80\begin{figure}
81\begin{center}
82\includegraphics[width=0.5\textwidth]{plots/para_arrayentry.pdf}
83\end{center}
84\caption{Processing Time with Different Circular Array Size (x axis: number of entries, y axis: CPU cycles per byte)}
85\label{para_arrayentry}
86\end{figure}
87
88
89\subsection{Load Balance}
90Figure \ref{work_balance} shows the work time and stall time of each thread with different test files.
91As discussed in the previous section, the work loads are not evenly divided.
92Therefore, the threads that process faster have to wait for predecessors to finish
93and thus incur a certain amount of stall time.
94The overhead introduced by data migration and resource contention is 27\% to 37\%,
95calculated as $(overall\_work time - sequential\_time)/sequential\_time$.
96The overhead introduced by synchronization is 15\% to 50\%, calculated as $overall\_stalltime/sequential\_time$.
97\begin{figure}
98\begin{center}
99\includegraphics[width=0.5\textwidth]{plots/work_balance.pdf}
100\end{center}
101\caption{Processing Time of Each Thread (y axis: CPU cycles per byte)}
102\label{work_balance}
103\end{figure}
104
105\subsection{Performance}
106Figure \ref{perf} demonstrates the XML well-formedness checking performance of
107the parallelized Parabix in comparison with the sequential version.
108The parallelized Parabix is more than 2 times faster on the quad core machine.
109With the sequential Parabix, the performance decreases as markup density of the test files increases.
110However, the high density files are better balanced and consume less stall time.
111Thus, it turns out that the processing time of each of the test files is about the same at 2.7 cycles per byte.
112
113\begin{figure}
114\begin{center}
115\includegraphics[width=0.5\textwidth]{plots/performance.pdf}
116\end{center}
117\caption{Processing Time (y axis: CPU cycles per byte)}
118\label{perf}
119\end{figure}
120\subsection{Power and Energy}
121Figure \ref{power} shows the average power consumed by the parallelized Parabix in comparison with the sequential version.
122By running four threads and using all the cores at the same time, the power consumption of the parallelized Parabix is much higher
123than the sequential version. However, the energy consumption is about the same, because the parallelized Parabix needs less processing time.
124In fact, as shown in Figure \ref{energy}, parsing soap.xml using parallelized Parabix consumes less energy than using sequential Parabix.
125\begin{figure}
126\begin{center}
127\includegraphics[width=0.5\textwidth]{plots/power.pdf}
128\end{center}
129\caption{Average Power (watts)}
130\label{power}
131\end{figure}
132\begin{figure}
133\begin{center}
134\includegraphics[width=0.5\textwidth]{plots/energy.pdf}
135\end{center}
136\caption{Energy Consumption (nJ per byte)}
137\label{energy}
138\end{figure}
139\subsection{Performance vs. Energy}
140Figure \ref{perf_energy} shows the performance and energy consumption of sequential and parallelized Parabix
141as well as two other XML parsers, Expat and Xerces.
142Parabix consumes 25\% of the energy of Xerces and Expat but with much better performance.
143Although the parallelized Parabix consumes slightly higher average energy than the sequential Parabix,
144it is more than 2 times faster.
145
146\begin{figure}
147\begin{center}
148\includegraphics[width=0.5\textwidth]{plots/perf_energy.pdf}
149\end{center}
150\caption{Energy vs. Performance (x axis: bytes per cycle, y axis: nJ per byte)}
151\label{perf_energy}
152\end{figure}
Note: See TracBrowser for help on using the repository browser.