source: docs/HPCA2012/01-intro.tex @ 1379

Last change on this file since 1379 was 1373, checked in by cameron, 8 years ago

Minor fixes

File size: 10.5 KB
Line 
1\section{Introduction}
2We have now long since reached the limit to classical Dennard voltage
3scaling that enabled us to keep all of transistors afforded by
4Moore's law active. This has already resulted in a rethink of the way
5general-purpose processors are built: processor frequencies have
6remained stagnant over the last 5 years with the capability to boost
7core speeds on Intel multicores only if other cores on the chip are
8shut off. Chip makers strive to achieve energy efficient computing by
9operating at more optimal core frequencies and aim to increase
10performance with a larger number of cores. Unfortunately, given the
11limited levels of parallelism that can be found in
12applications~\cite{blake-isca-2010}, it is not certain how many cores
13can be productively used in scaling our
14chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting
15parallelism across multiple cores tends to require heavy weight threads
16that are difficult to manage and synchronize.
17
18The desire to improve the overall efficiency of computing is pushing
19designers to explore customized hardware~\cite{venkatesh-asplos-2010,
20  hameed-isca-2010} that accelerate specific parts of an application
21while reducing the overheads present in general-purpose
22processors. They seek to exploit the transistor bounty to provision
23many different accelerators and keep only the accelerators needed for
24an application active while switching off others on the chip to save
25power consumption. While promising, given the fast evolution of
26languages and software, its hard to define a set of fixed-function
27hardware for commodity processors. Furthermore, the toolchain to
28create such customized hardware is itself a hard research
29challenge. We believe that software, applications, and runtime models
30themselves can be refactored to significantly improve the overall
31computing efficiency of commodity processors.
32
33In this paper, we tackle the infamous ``thirteenth dwarf''
34(parsers/finite state machines) that is considered to be the hardest
35application class to parallelize~\cite{Asanovic:EECS-2006-183} and
36show how Parabix, a novel software architecture, tool chain and
37run-time environment can indeed be used to dramatically improve
38parsing efficiency on commodity processors.  Based on the concept of
39transposing byte-oriented character data into parallel bit streams for
40the individual bits of each byte, the Parabix framework exploits the
41SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM)
42to process hundreds of character positions in an input stream
43simultaneously.  To achieve transposition, Parabix exploits
44sophisticated SIMD instructions that enable data elements to be packed
45and unpacked from registers in a regular manner which improves the
46overall cache access behavior of the application resulting in
47significantly fewer misses and better utilization.  Parabix also
48dramatically reduces branches in parsing code resulting in a more
49efficient pipeline and substantially improves register/cache
50utilization which minimizes energy wasted on data transfers.
51
52We apply Parabix technology to the problem of XML parsing and develop
53several implementations for different computing platforms.  XML is a
54particularly interesting application; it is a standard of the web
55consortium that provides a common framework for encoding and
56communicating data.  XML provides critical data storage for
57applications ranging from Office Open XML in Microsoft Office to NDFD
58XML of the NOAA National Weather Service, from KML in Google Earth to
59Castor XML in the Martian Rovers, as well as ubiquitous XML data in Android phones.  XML
60parsing efficiency is important for multiple application areas; in
61server workloads the key focus in on overall transactions per second,
62while in applications in network switches and cell phones, latency
63and energy are of paramount importance.  Traditional
64software-based XML parsers have many inefficiencies including
65considerable branch misprediction penalties due to complex
66input-dependent branching structures as well as poor use of memory bandwidth and
67data caches due to byte-at-a-time processing and multiple buffering.
68XML ASIC chips have been around for over 6 years, but typically lag
69behind CPUs in technology due to cost constraints. Our focus is how
70much we can improve performance of the XML parser on commodity
71processors with Parabix technology.
72
73In the end, as summarized by
74Figure~\ref{perf-energy} our Parabix-based XML parser improves the
75performance %by ?$\times$
76and energy efficiency %by ?$\times$
77several-fold compared
78to widely-used software parsers, approaching the
79%?$cycles/input-byte$
80performance of ASIC XML
81parsers.%~\cite{}.
82\footnote{The actual energy consumption of the XML
83  ASIC chips is not published by the companies.}
84
85Overall we make the following contributions in this paper.
86
871) We outline the Parabix architecture, tool chain and run-time
88environment and describe how it may be used to produce efficient
89XML parser implementations on a variety of commodity processors.
90While studied in the context of XML parsing, the Parabix framework
91can be widely applied to many problems in text processing and
92parsing.
93
942) We compare  Parabix XML parsers against conventional parsers
95and assess the improvement in overall performance and energy efficiency
96on each platform.   We are the first to compare and contrast SSE/AVX extensions across
97multiple generation of Intel processors and show that there are
98performance challenges when using newer generation SIMD extensions,
99possibly due to their memory interface. We compare the ARM Neon
100extensions against the x86 SIMD extensions and comment on the latency of
101SIMD operations across these architectures.
102
1033) Finally, building on the SIMD parallelism of Parabix technology,
104we multithread the Parabix XML parser to to enable the different
105stages in the parser to exploit SIMD units across all the cores.
106This further improves performance while maintaining the energy consumption
107constant with the sequential version.
108
109
110\begin{comment}
111Figure~\ref{perf-energy} is an energy-performance scatter plot showing
112the results obtained.
113
114
115With all this XML processing, a substantial literature has arisen
116addressing XML processing performance in general and the performance
117of XML parsers in particular.  Nicola and John specifically identified
118XML parsing as a threat to database performance and outlined a number
119of potential directions for potential performance improvements
120\cite{NicolaJohn03}.  The nature of XML APIs was found to have a
121significant affect on performance with event-based SAX (Simple API for
122XML) parsers avoiding the tree construction costs of the more flexible
123DOM (Document Object Model) parsers \cite{Perkins05}.  The commercial
124importance of XML parsing spurred developments of hardware-based
125approaches including the development of a custom XML chip
126\cite{Leventhal2009} as well as FPGA-based implementations
127\cite{DaiNiZhu2010}.  However promising these approaches may be for
128particular niche applications, it is likely that the bulk of the
129world's XML processing workload will be carried out on commodity
130processors using software-based solutions.
131
132To accelerate XML parsing performance in software, most recent
133work has focused on parallelization.  The use of multicore
134parallelism for chip multiprocessors has attracted
135the attention of several groups \cite{ZhangPanChiu09, ParaDOM2009, LiWangLiuLi2009},
136while SIMD (Single Instruction Multiple Data) parallelism
137has been of interest to Intel in designing new SIMD instructions\cite{XMLSSE42}
138, as well as to the developers of parallel bit stream technology
139\cite{CameronHerdyLin2008,Cameron2009,Cameron2010}.
140Each of these approaches has shown considerable performance
141benefits over traditional sequential parsing techniques that follow the
142byte-at-a-time model.
143\end{comment}
144
145
146
147\begin{figure}
148\begin{center}
149\includegraphics[width=85mm]{plots/performance_energy_chart.pdf}
150\end{center}
151\caption{XML Parser Technology Energy vs. Performance}
152\label{perf-energy}
153\end{figure}
154
155The remainder of this paper is organized as follows.
156Section~\ref{section:background} presents background material on XML parsing
157and provides insight into the inefficiency of traditional parsers on
158mainstream processors.  Section~\ref{section:parabix} describes the
159Parabix architecture, tool chain and run-time environment.
160Section~\ref{section:parser} describes the application of the
161Parabix framework to the construction of an XML parser
162enforcing all the well-formedness rules of the XML
163specification.  Section~\ref{section:methodology} then describes
164the overall methodology of our performance and energy study.
165Section~\ref{section:baseline} presents a detailed
166performance evaluation on a \CITHREE\ processor as
167our primary evaluation platform, addressing a number of
168microarchitectural issues including cache misses, branch
169mispredictions, and SIMD instruction counts.  Section~\ref{section:scalability} examines
170scalability and performance gains through three generations of Intel
171architecture.  Section~\ref{section:avx} examines the extension
172of the Parabix technology to take advantage of Intel's new
173256-bit AVX technology, while Section~\ref{section:neon} investigates
174the applications of this technology on mobile platforms using
175ARM processors with Neon SIMD extensions.
176Section~\ref{section:multithread} then looks at the multithreading of the
177Parabix XML parser using pipeline parallelism.
178Section~\ref{section:related} discusses related work, after which
179Section~\ref{section:conclusion} concludes the paper.
180
181
182
183
184%One area in which both servers and mobile devices devote considerable
185%computational effort into is in the processing of Extensible Markup
186%Language (XML) documents.  It was predicted that corporate servers
187%would see a ``growth in XML traffic\ldots from 15\% [of overall
188%network traffic] in 2004 to just under 48\% by 2008''
189%\cite{coyle2005}.  Further, ``from the point of view of server
190%efficiency[,] XML\ldots is the closest thing there is to a ubiquitous
191%computing workload'' \cite{leventhal2009}.  In other words, XML is the
192%quickly becoming the backbone of most server/server and client/server
193%%information exchanges.  Similarly, there is growing interest in the
194%use of mobile web services for personalization, context-awareness, and
195%content-adaptation of mobile web sites---most of which rely on XML
196%\cite{canali2009}.  Whether the end user realizes it or not, XML is
197%part of their daily life.
198
199%Why are XML parsers important ?
200%Talk about XML parsers and what they do in general.
201%Brief few lines about byte-at-time ?
202%What's new with Parabix style approach ?
203%Introduce Parabix1 and Parabix2 ?
204%Present overall quantiative improvements compared to other parsers.
205
206
207
208
Note: See TracBrowser for help on using the repository browser.