source: docs/HPCA2012/01-intro.tex @ 1339

Last change on this file since 1339 was 1339, checked in by cameron, 8 years ago

Intro updates; section cross-references

File size: 10.5 KB
Line 
1\section{Introduction}
2We have now long since reached the limit to classical Dennard voltage scaling~\cite{},
3that enabled us to keep all of transistors afforded by Moore's law
4active. This has already resulted in a rethink
5of the way general-purpose processors are built: processor frequencies
6have remained stagnant over the last 5 years with the capability to
7boost core speeds on Intel multicores only if other
8cores on the chip are shut off. Chip makers strive to achieve energy
9efficient computing by operating at more optimal core frequencies and
10aim to increase performance with larger number of
11cores. Unfortunately, given the limited levels of
12parallelism that can be found in applications~\cite{blake-isca-2010},
13it is not certain how many cores can be productively used in
14scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because
15exploiting parallelism across multiple cores tends to require
16heavweight threads that are difficult to manage and synchronize.
17
18The desire to improve the overall efficiency of computing is pushing
19designers to explore customized hardware~\cite{venkatesh-asplos-2010,
20  hameed-isca-2010} that accelerate specific parts of an application
21while reducing the overheads present in general-purpose
22processors. They seek to exploit the transistor bounty to provision
23many different accelerators and keep only the accelerators needed for
24an application active while switching off others on the chip to save
25power consumption. While promising, given the fast evolution of
26languages and software, its hard to define a set of fixed-function
27hardware for commodity processors. Furthermore, the toolchain to
28create such customized hardware is itself a hard research
29challenge. We believe that software, applications, and runtime models
30themselves can be refactored to significantly improve the overall
31computing efficiency of commodity processors.
32
33In this paper, we tackle the infamous ``thirteenth dwarf'' (parsers/finite
34state machines) that is considered to be the hardest application
35class to parallelize~\cite{Asanovic:EECS-2006-183} and show how Parabix,
36a novel software architecture, tool chain and run-time environment
37can indeed be used to dramatically improve parsing efficiency on
38commodity processors.   
39Based on the concept of transposing
40byte-oriented character data into parallel bit streams for the
41individual bits of each byte, the Parabix framework exploits the SIMD
42extensions (SSE/AVX on x86, Neon on ARM) on commodity processors
43to process hundreds of character positions in an input
44stream simultaneously.  To achieve transposition, Parabix exploits
45sophisticated SIMD instructions that enable data elements to be packed and
46unpacked from registers in a regular manner which improve the overall cache access
47behavior of the application resulting in significantly fewer
48misses and better utilization.
49Parabix also dramatically reduces branches
50in parsing code resulting in a more efficient pipeline and substantially
51improves register/cache utilization which minimizes energy wasted on data
52transfers.   
53
54We study Parabix technology in application to the problem of XML parsing
55and develop several implementations for different computing platforms.
56XML is a particularly interesting application; it is a standard of the
57web consortium that provides a common framework for encoding and
58communicating data.  XML provides critical data storage for
59applications ranging from Office Open XML in Microsoft Office to NDFD
60XML of the NOAA National Weather Service, from KML in Google Earth to
61Castor XML in the Martian Rovers, a XML data in Android phones.  XML
62parsing efficiency is important for multiple application areas; in
63server workloads the key focus in on overall transactions per second
64while in applications in the network switches and cell phones latency
65and the energy cost of parsing is of paramount
66importance.   Traditional software-based XML parsers have many
67inefficiencies due to complex input-dependent branching structures
68leading to considerable branch misprediction penalties as well
69as poor use of memory bandwidth and data caches due to byte-at-a-time
70processing and multiple buffering.  XML ASIC chips have been around for over 6
71years, but typically lag behind CPUs in technology due to cost
72constraints. Our focus is how much can we improve performance of the
73XML parser on commodity processors with Parabix technology.
74
75In the end, as summarized by
76Figure~\ref{perf-energy} our Parabix-based XML parser improves the
77performance by ?$\times$ and energy efficiency by ?$\times$ compared
78to widely-used software parsers and approaching the performance of
79?$cycles/input-byte$ performance of ASIC XML
80parsers~\cite{}.\footnote{The actual energy consumption of the XML
81  ASIC chips is not published by the companies.}
82
83Overall we make the following contributions in this paper.
84
851) We introduce the Parabix architecture, tool chain and run-time
86environment and describe how it may be used to produce efficient
87XML parser implementations on a variety of commodity processors.
88While studied in the context of XML parsing, the Parabix framework
89can be widely applied to many problems in text processing and
90parsing.
91
922) We compare our Parabix XML parsers against conventional parsers
93and assess the improvement in overall performance and energy efficiency
94on each platform.   We are the first to compare and contrast SSE/AVX extensions across
95multiple generation of Intel processors and show that there are
96performance challenges when using newer generation SIMD extensions,
97possibly due to their memory interface. We compare the ARM Neon
98extensions against the x86 SIMD extensions and comment on the latency of
99SIMD operations across these architectures.
100
1013) Finally, building on the SIMD parallelism of Parabix technology,
102we multithread the Parabix XML parser to to enable the different
103stages in the parser to exploit SIMD units across all the cores.
104This further improves performance while maintaining the energy consumption
105constant with the sequential version.
106
107
108\begin{comment}
109Figure~\ref{perf-energy} is an energy-performance scatter plot showing
110the results obtained.
111
112
113With all this XML processing, a substantial literature has arisen
114addressing XML processing performance in general and the performance
115of XML parsers in particular.  Nicola and John specifically identified
116XML parsing as a threat to database performance and outlined a number
117of potential directions for potential performance improvements
118\cite{NicolaJohn03}.  The nature of XML APIs was found to have a
119significant affect on performance with event-based SAX (Simple API for
120XML) parsers avoiding the tree construction costs of the more flexible
121DOM (Document Object Model) parsers \cite{Perkins05}.  The commercial
122importance of XML parsing spurred developments of hardware-based
123approaches including the development of a custom XML chip
124\cite{Leventhal2009} as well as FPGA-based implementations
125\cite{DaiNiZhu2010}.  However promising these approaches may be for
126particular niche applications, it is likely that the bulk of the
127world's XML processing workload will be carried out on commodity
128processors using software-based solutions.
129
130To accelerate XML parsing performance in software, most recent
131work has focused on parallelization.  The use of multicore
132parallelism for chip multiprocessors has attracted
133the attention of several groups \cite{ZhangPanChiu09, ParaDOM2009, LiWangLiuLi2009},
134while SIMD (Single Instruction Multiple Data) parallelism
135has been of interest to Intel in designing new SIMD instructions\cite{XMLSSE42}
136, as well as to the developers of parallel bit stream technology
137\cite{CameronHerdyLin2008,Cameron2009,Cameron2010}.
138Each of these approaches has shown considerable performance
139benefits over traditional sequential parsing techniques that follow the
140byte-at-a-time model.
141\end{comment}
142
143
144
145\begin{figure}
146\begin{center}
147\includegraphics[width=85mm]{plots/performance_energy_chart.pdf}
148\end{center}
149\caption{XML Parser Technology Energy vs. Performance}
150\label{perf-energy}
151\end{figure}
152
153The remainder of this paper is organized as follows.
154Section~\ref{section:background} presents background material on XML parsing
155and provides insight into the inefficiency of traditional parsers on
156mainstream processors.  Section~\ref{section:parabix} describes the
157Parabix architecture, tool chain and run-time environment.
158Section~\ref{section:parser} describes the application of the
159Parabix framework to the construction of an XML parser
160meeting enforcing all the well-formedness rules of the XML
161specification.  Section~\ref{section:methodology} then describes
162the overall methodology of our performance and energy study.
163Section~\ref{section:baseline} presents a detailed
164performance evaluation on a \CITHREE\ processor as
165our primary evaluation platform, addressing a number of
166microarchitectural issues including cache misses, branch
167mispredictions, and SIMD instruction counts.  Section~ref{section:scalability} examines
168scalability and performance gains through three generations of Intel
169architecture.  Section~\ref{section:avx} examines the extension
170of the Parabix technology to take advantage of Intel's new
171256-bit AVX technology, while Section~\ref{section:neon} investigates
172the applications of this technology on mobile platforms using
173ARM processors with Neon SIMD extensions.
174Section~\ref{section:multithread} then looks at the multithreading of the
175Parabix XML parser using pipeline parallelism.
176Section~\ref{section:conclusion} concludes the paper.
177
178
179
180
181%One area in which both servers and mobile devices devote considerable
182%computational effort into is in the processing of Extensible Markup
183%Language (XML) documents.  It was predicted that corporate servers
184%would see a ``growth in XML traffic\ldots from 15\% [of overall
185%network traffic] in 2004 to just under 48\% by 2008''
186%\cite{coyle2005}.  Further, ``from the point of view of server
187%efficiency[,] XML\ldots is the closest thing there is to a ubiquitous
188%computing workload'' \cite{leventhal2009}.  In other words, XML is the
189%quickly becoming the backbone of most server/server and client/server
190%%information exchanges.  Similarly, there is growing interest in the
191%use of mobile web services for personalization, context-awareness, and
192%content-adaptation of mobile web sites---most of which rely on XML
193%\cite{canali2009}.  Whether the end user realizes it or not, XML is
194%part of their daily life.
195
196%Why are XML parsers important ?
197%Talk about XML parsers and what they do in general.
198%Brief few lines about byte-at-time ?
199%What's new with Parabix style approach ?
200%Introduce Parabix1 and Parabix2 ?
201%Present overall quantiative improvements compared to other parsers.
202
203
204
205
Note: See TracBrowser for help on using the repository browser.