source: docs/Working/icXML/icxml-main.tex @ 2516

Last change on this file since 2516 was 2516, checked in by cameron, 7 years ago

Updates, more streams to discuss

File size: 6.0 KB
Line 
1%-----------------------------------------------------------------------------
2%
3%               Template for sigplanconf LaTeX Class
4%
5% Name:         sigplanconf-template.tex
6%
7% Purpose:      A template for sigplanconf.cls, which is a LaTeX 2e class
8%               file for SIGPLAN conference proceedings.
9%
10% Guide:        Refer to "Author's Guide to the ACM SIGPLAN Class,"
11%               sigplanconf-guide.pdf
12%
13% Author:       Paul C. Anagnostopoulos
14%               Windfall Software
15%               978 371-2316
16%               paul@windfall.com
17%
18% Created:      15 February 2005
19%
20%-----------------------------------------------------------------------------
21
22
23\documentclass[10pt,preprint]{sigplanconf}
24
25% The following \documentclass options may be useful:
26%
27% 10pt          To set in 10-point type instead of 9-point.
28% 11pt          To set in 11-point type instead of 9-point.
29% authoryear    To obtain author/year citation style instead of numeric.
30\usepackage{subfigure}
31\usepackage{amsmath}
32\usepackage{graphicx}
33\usepackage{CJKutf8}
34\usepackage{morefloats}
35\begin{document}
36
37\conferenceinfo{EuroSys '13}{date, City.} 
38\copyrightyear{2013} 
39\copyrightdata{[to be supplied]} 
40
41\titlebanner{banner above paper title}        % These are ignored unless
42\preprintfooter{short description of paper}   % 'preprint' option specified.
43
44\def \icXML {icXML}
45\def \icXMLp {icXML-p}
46\def \PS {Parabix Subsystem}
47\def \MP {Markup Processor}
48
49\title{\icXML{}:  Accelerating a Commercial XML Parser Using SIMD and Multicore Technologies}
50%\subtitle{Subtitle Text, if any}
51\authorinfo{Anonymous Hackers}
52
53% \authorinfo{Nigel Medforth \and Dan Lin \and Kenneth S. Herdy \and Arrvindh Shriraman \and Robert D. Cameron }
54%            {International Characters, Inc., and Simon Fraser University}
55%            {\{nmedfort,lindanl,ksherdy,ashriram,cameron\}@cs.sfu.ca}
56
57\maketitle
58
59\begin{abstract}
60\input{abstract.tex}
61\end{abstract}
62
63\category{CR-number}{subcategory}{third-level}
64
65\terms
66term1, term2
67
68\keywords
69keyword1, keyword2
70
71\section{Introduction}
72
73Parallelization and acceleration of XML parsing is a widely
74studied problem that has seen the development of a number
75of interesting research prototypes using both SIMD and
76multicore parallelism.   Most works have investigated
77data parallel solutions on multicore
78architectures using various strategies to break input
79documents into segments that can be allocated to different cores.
80For example, one possibility for data
81parallelization is to add a pre-parsing step to compute
82a skeleton tree structure of an  XML document \cite{GRID2006}.
83The parallelization of the pre-parsing stage itself can be tackled with
84state machines \cite{E-SCIENCE2007, IPDPS2008}.
85Methods without pre-parsing have used speculation \cite{HPCC2011} or post-processing that
86combines the partial results \cite{ParaDOM2009}.
87A hybrid method that combines data parallelism and pipeline parallelism is proposed to
88hide the latency of the ``job'' that has to be done sequentially \cite{ICWS2008}.
89
90Fewer efforts have investigated SIMD parallelism, although this approach
91has the potential advantage of improving single core performance as well
92as offering savings in energy consumption.
93Intel introduced specialized SIMD string processing instructions in the SSE 4.2 instruction set extension
94and showed how they can be used to improve the performance of XML parsing \cite{XMLSSE42}.
95The Parabix framework uses generic SIMD extensions and bit parallel methods to
96process hundreds of XML input characters simultaneously \cite{Cameron2009, cameron-EuroPar2011}.
97Parabix prototypes have also combined SIMD methods with thread-level parallelism to
98achieve further acceleration on multicore systems \cite{HPCA2012}.
99
100In this paper, we move beyond research prototypes to consider
101the detailed integration of both SIMD and multicore parallelism into the
102Xerces-C++ parser of the Apache Software Foundation, an existing
103standards-compliant open-source parser that is widely used
104in commercial practice.    The challenge of this work is
105to parallelize the Xerces parser in such a way as to
106preserve the existing APIs as well as offering worthwhile
107end-to-end acceleration of XML processing.   
108To achieve the best results possible, we have undertaken
109a comprehensive restructuring of the Xerces-C++ parser,
110seeking to expose as many critical aspects of XML parsing
111as possible for parallelization.   Overall, we have
112employed Parabix-style methods in transcoding, tokenization
113and tag parsing,  parallel string comparison methods in symbol
114resolution, bit parallel methods in namespace processing, as well as staged
115processing with pipeline parallelism to take advantage of
116multiple cores.   
117
118The remainder of this paper is organized as follows.   Section 2 discusses
119the structure of the Xerces and Parabix XML parsers and the fundamental
120differences between the two parsing models.   Section 3 then presents
121the \icXML{} design based on a restructured Xerces architecture to
122incorporate SIMD parallelism using Parabix methods.   
123Section 4 moves on to consider the multithreading of the \icXML{} architecture
124using the pipeline parallelism model. 
125Section 5 analyzes the performance of both the single-threaded and
126multi-threaded versions of \icXML{} in comparison to original Xerces,
127demonstrating substantial end-to-end acceleration of
128a GML-to-SVG translation application written against the Xerces API.
129Section 6 concludes the
130paper with a discussion of future work and the potential for
131applying the techniques discussed herein in other application domains.
132
133\section{Background}
134\label{background}
135
136\input{background-xerces}
137\input{background-parabix}
138\input{background-fundemental-differences.tex}
139
140\section{Architecture}
141
142\input{arch-overview.tex}
143
144\input{arch-charactersetadapters.tex}
145
146\input{parfilter.tex}
147
148\input{arch-namespace.tex}
149
150\input{arch-errorhandling.tex}
151
152\input{multithread.tex}
153
154\input{performance.tex}
155
156\input{conclusion.tex}
157
158% We recommend abbrvnat bibliography style.
159
160\bibliographystyle{abbrvnat}
161
162% The bibliography should be embedded for final submission.
163
164\bibliography{reference}
165
166
167\end{document}
Note: See TracBrowser for help on using the repository browser.