source: docs/Cmpt886_Project_Report/02-background.tex @ 3508

Last change on this file since 3508 was 1122, checked in by lindanl, 9 years ago

Add project report for cmpt886 on parallelizing Parabix2

File size: 3.9 KB
Line 
1\section{Background}
2\label{section:background}
3\subsection{XML}
4Extensible Markup Language (XML) is a simple text format derived from SGML,
5officially adopted by W3C as a standard in 1998 \cite{TR:XML}.
6It is essentially a set of rules for encoding documents or data in a human readable form.
7For example, a typical XML file could be:
8\begin{figure}[h]
9{
10\scriptsize
11\begin{verbatim}
12<?xml version="1.0"?>
13<shipping>
14  <shipTo  country="Canada">
15    <name> Alice </name>
16    <address> XXX </address>
17  </shipTo>
18  ...
19</shipping>
20\end{verbatim}
21}
22\caption{Simple XML Document}
23\label{fig:sample_xml}
24\end{figure}
25
26\subsection{XML Well-Formedness Checking}
27The XML specification defines an XML document as a text which is well-formed.
28That is, it satisfies a list of syntax rules provided in the specification.
29If a document are correctly formed and conform to all the well-formedness rules, then it is considered as well formed.
30
31XML well-formedness checking application is evaluated for each XML parsing technology.
32The decision to perform XML well-formedness checking for this study is based on the following rational.
33First, each XML parser must provide well-formedness checking functionality.
34Secondly, this functionality meets the minimum requirement of an XML document being readable
35by computers and avoids any additional costs due to non-parsing related computation.
36
37\subsection{Traditional XML Parsers}
38Traditional XML parsers such as Expat and Xerces process XML one byte at a time.
39They parse a source document serially, from the first to the last byte of the source file.
40Each character of the source text is examined in turn to distinguish between the XML-specific markup,
41such as an opening angle bracket `$<$', and the content held between markups.
42This method creates a lot of branches in the program.
43In fact, Xerces can execute as many as 13 branches per byte it processed.
44
45Both Expat and Xerces-C are C/C++ based and open-source.
46Expat was originally released in 1998; it is currently used in Mozilla Firefox and Open Office \cite{expat}.
47Xerces-C was released in 1999 and is the foundation of the Apache XML project \cite{xerces}.
48
49\subsection{Parabix: SIMD-based XML Parser}
50Parabix is a SIMD-based parallel bitstream XML parser.
51It first transposes the input byte stream into eight parallel bitstreams called basis bitstreams.
52Then it calculates all the markup item streams based on those basis bitstreams using SIMD bitwise logic instructions.
53A markup item streams is simply a sequence of 0s and 1s, where there is
54one such bit in the stream for each markup item in a source data stream.
55As shown in Figure \ref{fig:ParabixExample}, $M_0$ is left angle stream shifted by 1.
56Given all the markup item streams, parsing becomes a series of parallel bit scans.
57For example, in Figure \ref{fig:ParabixExample}, the first step of parsing the start tag
58is to scan through the tag names using $M_0$. On a little endian byte order machine,
59parallel bit scan is implemented using addition and logic operations \cite{Cameron2010}.
60$M_1$ gives the result of scanning.
61
62\begin{figure}[h]
63\begin{center}
64\begin{tabular}{lr}\\
65source data                     & \verb`<t1>abc</t1><tag2/>`\\
66$Name = $ [0-9a-z]              & \verb`.11.111..11..1111..`\\
67$M_0 = \texttt{[<]}>>1$         & \verb`.1......1....1.....`\\
68$M_1 = scanThru(M_0, Name)$     & \verb`...1....1........1.`
69\end{tabular}
70\end{center}
71\caption{Partial Start Tag Parsing}
72\label{fig:ParabixExample}
73\end{figure}
74
75
76Figure \ref{perf_bg} shows the results of the overall performance for XML well-formedness checking
77evaluated as CPU cycles per input bytes. Parabix is 2 to 7 times faster than traditional parsers.
78\begin{figure}
79\begin{center}
80\includegraphics[width=0.5\textwidth]{plots/performance_bg.pdf}
81\end{center}
82\caption{Processing Time (y-axis: CPU cycles per byte)}
83\label{perf_bg}
84\end{figure}
85
86
87Parabix exploits the data parallelism via SIMD instructions.
88The goal of this study is to further improve the performance using chip-multiprocessors.
Note: See TracBrowser for help on using the repository browser.