source: docs/HPCA2012/01-intro.tex @ 1328

Last change on this file since 1328 was 1326, checked in by ashriram, 8 years ago

New Intro New title

File size: 10.0 KB
2Classical Dennard Scaling~\cite{} which ensured that voltage scaling
3would enable us to keep all of transistors afforded by Moore's law
4active, has currently stopped. This has already resulted in a rethink
5of the way general-purpose processors are built: processor frequencies
6have remained stagnant over the last 5 years and processor cores in
7multIntel multicores provide capability to boost core speeds if other
8cores on the chip are shut-off. Chip makers strive to achieve energy
9efficient computing by operating at more optimal core frequencies and
10aiming to increase performance with larger number of
11cores. Unfortunately, given the levels of
12parallelism~\cite{blake-isca-2010} in applications, that multicores
13can exploit it is not certain up to how many cores we can continue
14scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because
15exploiting parallelism across multiple cores tends to require
16heavweight threads that are difficult to manage and synchronize.
19The desire to improve the overall efficiency of computing is pushing
20designers to explore customized hardware~\cite{venkatesh-asplos-2010,
21  hameed-isca-2010} that accelerate specific parts of an application
22while reducing the overheads present in general-purpose
23processors. They seek to exploit the transistor bounty to provision
24many different accelerators and keep only the accelerators needed for
25an application active while switching-off others on the chip to save
26power consumption. While promising, given the fast evolution of
27languages and software, its hard to define a set of fixed-function
28hardware for commodity processors. Furthermore, the toolchain to
29create such customized hardware is itself a hard research
30challenge. We believe that software, applications, and runtime models
31themselves can be refactored to significantly improve the overall
32computing efficiency of commodity processors.
35In this paper, we demonstrate with an XML parser that changes to the
36underlying algorithm and compute model can significantly improve the
37efficiency on commodity processors. We achieve this efficiency by
38carefully redesigning the algorithm to exploit Parallel Bitstream
39runtime framework (Parabix) that exploits the SIMD extensions (SSE/AVX
40on x86, Neon on ARM) on commodity processors. The Parabix framework
41exploits modern instructions in the processor ISA that can execute 10s
42of operations (on multiple chararacter streams) in a single
43instruction and amortizes the overhead of general-purpose
44processor. Parabix also minimizes or eliminate branches entirely
45resulting in a more efficient pipeline and and improves overall
46register/cache utilization which minimizes energy wasted on data
47transfers. Parabix SSE/AVX exploits also include sophisticated
48instructions that enable the algorithm to pack and unpack the data
49elements from the registers which makes the overall cache access
50behavior of the application regular resulting in significantly fewer
51misses and better utilization. Overall as summarized by
52Figure~\ref{perf-energy} our Parabix-based XML parser improves the
53performance by ?$\times$ and energy efficiency by ?$\times$ compared
54to widely-used software parsers and approaching the performance of
55?$cycles/input-byte$ performance of ASIC XML
56parsers~\cite{}.\footnote{The actual energy consumption of the XML
57  ASIC chips is not published by the companies.}
60XML is a particularly interesting application; it is a standard of the
61web consortium that provides a common framework for encoding and
62communicating data.  XML provides critical data storage for
63applications ranging from Office Open XML in Microsoft Office to NDFD
64XML of the NOAA National Weather Service, from KML in Google Earth to
65Castor XML in the Martian Rovers, a XML data in Android phones.  XML
66parsing efficiency is important for multiple application areas; in
67server workloads the key focus in on overall transactions per second
68while in applications in the network switches and cell phones latency
69and the energy cost of parsing is of paramount
70importance. Software-based XML parsers are particulary inefficient and
71consist of giant \textit{switch-case} statements, which waste
72processor resources processor since they introduce input-data
73dependent branches. They also have poor cache efficiency since they
74sift forward and backward through the input-data stream trying to
75match the parsed tags.  XML ASIC chips have been around for over 6
76years, but typically lag behind CPUs in technology due to cost
77constraints. Our focus is how much can we improve performance of the
78XML parser on commodity processors with Parabix technology.
80Overall we make the following contributions in this paper.
821) We develop an XML parser that demonstrates the impact of
83redesigning the core of an application to make more efficient use of
84commodity processors. We compare the Parabix-XML parser against
85conventional parsers and demonstrate the improvement in overall
86performance and energy efficiency. We also paralleillize the
87Parabix-XML parser to enable the different stages in the parser to
88exploit SIMD units across all the cores. This further improves
89performance while maintaining the energy consumption constant with the
90sequential version.
922) We are the first to compare and contrast SSE/AVX extensions across
93multiple generation of Intel processors and show that there are
94performance challenges when using newer generation SIMD extensions,
95possibly due to their memory interface. We compare ARM's Neon again
96x86's SIMD extensions and comment on the latency of SIMD operations
97across these architectures.
993) Finally, we introduce a runtime framework, \textit{Parabix}, that
100abstracts the SIMD specifics of the machine (e.g., register widths)
101and provides a language framework to enable applications to run
102efficiently on commodity processors. Parabix enables the
103general-purpose multicores to be used efficiently by an entirely new
104class of applications, text processing and parsing.
112Figure~\ref{perf-energy} is an energy-performance scatter plot showing
113the results obtained.
116With all this XML processing, a substantial literature has arisen
117addressing XML processing performance in general and the performance
118of XML parsers in particular.  Nicola and John specifically identified
119XML parsing as a threat to database performance and outlined a number
120of potential directions for potential performance improvements
121\cite{NicolaJohn03}.  The nature of XML APIs was found to have a
122significant affect on performance with event-based SAX (Simple API for
123XML) parsers avoiding the tree construction costs of the more flexible
124DOM (Document Object Model) parsers \cite{Perkins05}.  The commercial
125importance of XML parsing spurred developments of hardware-based
126approaches including the development of a custom XML chip
127\cite{Leventhal2009} as well as FPGA-based implementations
128\cite{DaiNiZhu2010}.  However promising these approaches may be for
129particular niche applications, it is likely that the bulk of the
130world's XML processing workload will be carried out on commodity
131processors using software-based solutions.
133To accelerate XML parsing performance in software, most recent
134work has focused on parallelization.  The use of multicore
135parallelism for chip multiprocessors has attracted
136the attention of several groups \cite{ZhangPanChiu09, ParaDOM2009, LiWangLiuLi2009},
137while SIMD (Single Instruction Multiple Data) parallelism
138has been of interest to Intel in designing new SIMD instructions\cite{XMLSSE42}
139, as well as to the developers of parallel bit stream technology
141Each of these approaches has shown considerable performance
142benefits over traditional sequential parsing techniques that follow the
143byte-at-a-time model.
152\caption{XML Parser Technology Energy vs. Performance}
156The remainder of this paper is organized as follows.
157Section~\ref{background} presents background material on XML parsing
158and provides insight into the inefficiency of traditional parsers on
159mainstream processors.  Section~\ref{parallel-bitstream} reviews
160parallel bit stream technology a framework to exploit sophisticated
161data parallel SIMD extensions on modern processors.  Section 5
162presents a detailed performance evaluation on a \CITHREE\ processor as
163our primary evaluation platform, addressing a number of
164microarchitectural issues including cache misses, branch
165mispredictions, and SIMD instruction counts.  Section 6 examines
166scalability and performance gains through three generations of Intel
167architecture culminating with a performance assessment on our two
168week-old \SB\ test machine. We looks specifically at issues in
169applying the new 256-bit AVX technology to parallel bit stream
170technology and notes that the major performance benefit seen so far
171results from the change to the non-destructive three-operand
172instruction format.
177%One area in which both servers and mobile devices devote considerable
178%computational effort into is in the processing of Extensible Markup
179%Language (XML) documents.  It was predicted that corporate servers
180%would see a ``growth in XML traffic\ldots from 15\% [of overall
181%network traffic] in 2004 to just under 48\% by 2008''
182%\cite{coyle2005}.  Further, ``from the point of view of server
183%efficiency[,] XML\ldots is the closest thing there is to a ubiquitous
184%computing workload'' \cite{leventhal2009}.  In other words, XML is the
185%quickly becoming the backbone of most server/server and client/server
186%%information exchanges.  Similarly, there is growing interest in the
187%use of mobile web services for personalization, context-awareness, and
188%content-adaptation of mobile web sites---most of which rely on XML
189%\cite{canali2009}.  Whether the end user realizes it or not, XML is
190%part of their daily life.
192%Why are XML parsers important ?
193%Talk about XML parsers and what they do in general.
194%Brief few lines about byte-at-time ?
195%What's new with Parabix style approach ?
196%Introduce Parabix1 and Parabix2 ?
197%Present overall quantiative improvements compared to other parsers.
Note: See TracBrowser for help on using the repository browser.