source: docs/Cmpt886_Project_Report/01-intro.tex @ 5778

Last change on this file since 5778 was 1122, checked in by lindanl, 8 years ago

Add project report for cmpt886 on parallelizing Parabix2

File size: 3.4 KB
Line 
1\section{Introduction}
2
3Extensible Markup Language(XML) is widely used for web applications and services.
4Parsing of XML may be a significant factor affecting the response time and even the throughput.
5XML also plays an important role in databases. Enterprises keep large amounts of business critical data permanently in XML format.
6However, those databases may not provide the same high performance characteristics as relational data processing
7because processing of XML requires parsing of XML documents which can be very CPU-intensive \cite{threat}.
8
9CPU-intensive applications are able to benefit from multicore systems by dividing the work to run on different cores.
10The growing prevalence of multicore systems motivates the investigation of parallelizing XML parsers.
11However, parallelizing programs is challenging because the gains from parallel execution can be overshadowed by the cost of communication and synchronization.
12Parallelizing XML parsing software is further complicated because it is sequential in nature.
13
14There are three basic parallelizing methods. Task parallelism refers to parallel executions where the output of each task never reaches the input of the other.
15Task parallelizing XML parsers can be easily achieved by processing different files on different cores at the same time.
16Unfortunately, the files used with either web services or databases are often large requiring fast parsing on a single file.
17
18Data parallelism refers to the partitioning of data into several chunks which are each processed in parallel.
19Data parallelism is well-suited to multicore systems when there are no dependencies between each partition.
20Otherwise, it can introduce excessive communication overheads and potential hazards.
21The nature of XML files makes them hard to partition nicely for data parallelism.
22Several approaches have been used to address this problem.
23A preparsing phase has been proposed to help partition the XML document \cite{dataparallel}.
24The goal of this preparsing is to determine the tree structure of the XML document so that it can be used to guide the full parsing in the next phase.
25Other methods such as overlapping and speculation have also been used for data parallel processing \cite{browser}.
26
27Pipeline parallelism is appropriate when the computation can be divided into multiple stages such that %the input of each stage is the output from the previous stages.
28data naturally flows from one stage to another and synchronization is needed only at stage interface.
29Compared to data parallelism, it offers reduced latency, reduced buffering and less communication \cite{streamit}.
30But it is often hard to implement well, due to problems of synchronization and balancing the load between stages.
31However, previous work presented a high performance XML parser, Parabix, using parallel bitstream techniques \cite{CameronHerdyLin2008, Cameron2010}.
32Its unique structure makes it easy to separate the parsing into several passes.
33This multipass structure makes it suitable to apply pipeline parallelism on multicore systems.
34
35This paper presents a pipelining strategy for Parabix based on the observation of data dependencies and analysis of the characteristics of each pass.
36The resulting performance and energy consumption is compared with sequential Parabix and other sequential XML parsers that have been widely used.
37The generalization of this pipelining strategy to other applications based on parallel bitstream techniques is described.
38
39
40
Note: See TracBrowser for help on using the repository browser.