source: docs/HPCA2012/01-intro.tex @ 1330

Last change on this file since 1330 was 1330, checked in by cameron, 8 years ago

Revise intro

File size: 10.1 KB
Line 
1\section{Introduction}
2We have now long since reached the limit to classical Dennard voltage scaling~\cite{},
3that enabled us to keep all of transistors afforded by Moore's law
4active. This has already resulted in a rethink
5of the way general-purpose processors are built: processor frequencies
6have remained stagnant over the last 5 years with the capability to
7boost core speeds on Intel multicores only if other
8cores on the chip are shut off. Chip makers strive to achieve energy
9efficient computing by operating at more optimal core frequencies and
10aim to increase performance with larger number of
11cores. Unfortunately, given the limited levels of
12parallelism that can be found in applications~\cite{blake-isca-2010},
13it is not certain how many cores can be productively used in
14scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because
15exploiting parallelism across multiple cores tends to require
16heavweight threads that are difficult to manage and synchronize.
17
18The desire to improve the overall efficiency of computing is pushing
19designers to explore customized hardware~\cite{venkatesh-asplos-2010,
20  hameed-isca-2010} that accelerate specific parts of an application
21while reducing the overheads present in general-purpose
22processors. They seek to exploit the transistor bounty to provision
23many different accelerators and keep only the accelerators needed for
24an application active while switching off others on the chip to save
25power consumption. While promising, given the fast evolution of
26languages and software, its hard to define a set of fixed-function
27hardware for commodity processors. Furthermore, the toolchain to
28create such customized hardware is itself a hard research
29challenge. We believe that software, applications, and runtime models
30themselves can be refactored to significantly improve the overall
31computing efficiency of commodity processors.
32
33In this paper, we tackle the infamous ``thirteenth dwarf'' (parsers/finite
34state machines) that is considered to be the hardest application
35class to parallelize~\cite{Asanovic:EECS-2006-183} and show how Parabix,
36a novel software architecture, tool chain and run-time environment
37can indeed be used to dramatically improve parsing efficiency on
38commodity processors.   
39Based on the concept of transposing
40byte-oriented character data into parallel bit streams for the
41individual bits of each byte, the Parabix framework exploits the SIMD
42extensions (SSE/AVX on x86, Neon on ARM) on commodity processors
43to process hundreds of character positions in an input
44stream simultaneously.  To achieve transposition, Parabix exploits
45sophisticated SIMD instructions that enable data elements to be packed and
46unpacked from registers in a regular manner which improve the overall cache access
47behavior of the application resulting in significantly fewer
48misses and better utilization.
49Parabix also dramatically reduces branches
50in parsing code resulting in a more efficient pipeline and substantially
51improves register/cache utilization which minimizes energy wasted on data
52transfers.   
53
54We study Parabix technology in application to the problem of XML parsing
55and develop several implementations for different computing platforms.
56
57
58XML is a particularly interesting application; it is a standard of the
59web consortium that provides a common framework for encoding and
60communicating data.  XML provides critical data storage for
61applications ranging from Office Open XML in Microsoft Office to NDFD
62XML of the NOAA National Weather Service, from KML in Google Earth to
63Castor XML in the Martian Rovers, a XML data in Android phones.  XML
64parsing efficiency is important for multiple application areas; in
65server workloads the key focus in on overall transactions per second
66while in applications in the network switches and cell phones latency
67and the energy cost of parsing is of paramount
68importance. Software-based XML parsers are particulary inefficient and
69consist of giant \textit{switch-case} statements, which waste
70processor resources processor since they introduce input-data
71dependent branches. They also have poor cache efficiency since they
72sift forward and backward through the input-data stream trying to
73match the parsed tags.  XML ASIC chips have been around for over 6
74years, but typically lag behind CPUs in technology due to cost
75constraints. Our focus is how much can we improve performance of the
76XML parser on commodity processors with Parabix technology.
77
78In the end, as summarized by
79Figure~\ref{perf-energy} our Parabix-based XML parser improves the
80performance by ?$\times$ and energy efficiency by ?$\times$ compared
81to widely-used software parsers and approaching the performance of
82?$cycles/input-byte$ performance of ASIC XML
83parsers~\cite{}.\footnote{The actual energy consumption of the XML
84  ASIC chips is not published by the companies.}
85
86Overall we make the following contributions in this paper.
87
881) We introduce the Parabix architecture, tool chain and run-time
89environment and describe how it may be used to produce efficient
90XML parser implementations on a variety of commodity processors.
91While studied in the context of XML parsing, the Parabix framework
92can be widely applied to many problems in text processing and
93parsing.
94
952) We compare our Parabix XML parsers against conventional parsers
96and assess the improvement in overall performance and energy efficiency
97on each platform.   We are the first to compare and contrast SSE/AVX extensions across
98multiple generation of Intel processors and show that there are
99performance challenges when using newer generation SIMD extensions,
100possibly due to their memory interface. We compare the ARM Neon
101extensions against the x86 SIMD extensions and comment on the latency of
102SIMD operations across these architectures.
103
1043) Finally, building on the SIMD parallelism of Parabix technology,
105we multithread the Parabix XML parser to to enable the different
106stages in the parser to exploit SIMD units across all the cores.
107This further improves performance while maintaining the energy consumption
108constant with the sequential version.
109
110
111\begin{comment}
112Figure~\ref{perf-energy} is an energy-performance scatter plot showing
113the results obtained.
114
115
116With all this XML processing, a substantial literature has arisen
117addressing XML processing performance in general and the performance
118of XML parsers in particular.  Nicola and John specifically identified
119XML parsing as a threat to database performance and outlined a number
120of potential directions for potential performance improvements
121\cite{NicolaJohn03}.  The nature of XML APIs was found to have a
122significant affect on performance with event-based SAX (Simple API for
123XML) parsers avoiding the tree construction costs of the more flexible
124DOM (Document Object Model) parsers \cite{Perkins05}.  The commercial
125importance of XML parsing spurred developments of hardware-based
126approaches including the development of a custom XML chip
127\cite{Leventhal2009} as well as FPGA-based implementations
128\cite{DaiNiZhu2010}.  However promising these approaches may be for
129particular niche applications, it is likely that the bulk of the
130world's XML processing workload will be carried out on commodity
131processors using software-based solutions.
132
133To accelerate XML parsing performance in software, most recent
134work has focused on parallelization.  The use of multicore
135parallelism for chip multiprocessors has attracted
136the attention of several groups \cite{ZhangPanChiu09, ParaDOM2009, LiWangLiuLi2009},
137while SIMD (Single Instruction Multiple Data) parallelism
138has been of interest to Intel in designing new SIMD instructions\cite{XMLSSE42}
139, as well as to the developers of parallel bit stream technology
140\cite{CameronHerdyLin2008,Cameron2009,Cameron2010}.
141Each of these approaches has shown considerable performance
142benefits over traditional sequential parsing techniques that follow the
143byte-at-a-time model.
144\end{comment}
145
146
147
148\begin{figure}
149\begin{center}
150\includegraphics[width=85mm]{plots/performance_energy_chart.pdf}
151\end{center}
152\caption{XML Parser Technology Energy vs. Performance}
153\label{perf-energy}
154\end{figure}
155
156The remainder of this paper is organized as follows.
157Section~\ref{background} presents background material on XML parsing
158and provides insight into the inefficiency of traditional parsers on
159mainstream processors.  Section~\ref{parallel-bitstream} reviews
160parallel bit stream technology a framework to exploit sophisticated
161data parallel SIMD extensions on modern processors.  Section 5
162presents a detailed performance evaluation on a \CITHREE\ processor as
163our primary evaluation platform, addressing a number of
164microarchitectural issues including cache misses, branch
165mispredictions, and SIMD instruction counts.  Section 6 examines
166scalability and performance gains through three generations of Intel
167architecture culminating with a performance assessment on our two
168week-old \SB\ test machine. We looks specifically at issues in
169applying the new 256-bit AVX technology to parallel bit stream
170technology and notes that the major performance benefit seen so far
171results from the change to the non-destructive three-operand
172instruction format.
173
174
175
176
177%One area in which both servers and mobile devices devote considerable
178%computational effort into is in the processing of Extensible Markup
179%Language (XML) documents.  It was predicted that corporate servers
180%would see a ``growth in XML traffic\ldots from 15\% [of overall
181%network traffic] in 2004 to just under 48\% by 2008''
182%\cite{coyle2005}.  Further, ``from the point of view of server
183%efficiency[,] XML\ldots is the closest thing there is to a ubiquitous
184%computing workload'' \cite{leventhal2009}.  In other words, XML is the
185%quickly becoming the backbone of most server/server and client/server
186%%information exchanges.  Similarly, there is growing interest in the
187%use of mobile web services for personalization, context-awareness, and
188%content-adaptation of mobile web sites---most of which rely on XML
189%\cite{canali2009}.  Whether the end user realizes it or not, XML is
190%part of their daily life.
191
192%Why are XML parsers important ?
193%Talk about XML parsers and what they do in general.
194%Brief few lines about byte-at-time ?
195%What's new with Parabix style approach ?
196%Introduce Parabix1 and Parabix2 ?
197%Present overall quantiative improvements compared to other parsers.
198
199
200
201
Note: See TracBrowser for help on using the repository browser.