source: docs/HPCA2012/01-intro.tex @ 1348

Last change on this file since 1348 was 1348, checked in by ashriram, 8 years ago

new abstract for new intro

File size: 10.5 KB
2We have now long since reached the limit to classical Dennard voltage
3scaling that enabled us to keep all of transistors afforded by
4Moore's law active. This has already resulted in a rethink of the way
5general-purpose processors are built: processor frequencies have
6remained stagnant over the last 5 years with the capability to boost
7core speeds on Intel multicores only if other cores on the chip are
8shut off. Chip makers strive to achieve energy efficient computing by
9operating at more optimal core frequencies and aim to increase
10performance with larger number of cores. Unfortunately, given the
11limited levels of parallelism that can be found in
12applications~\cite{blake-isca-2010}, it is not certain how many cores
13can be productively used in scaling our
14chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting
15parallelism across multiple cores tends to require heavweight threads
16that are difficult to manage and synchronize.
18The desire to improve the overall efficiency of computing is pushing
19designers to explore customized hardware~\cite{venkatesh-asplos-2010,
20  hameed-isca-2010} that accelerate specific parts of an application
21while reducing the overheads present in general-purpose
22processors. They seek to exploit the transistor bounty to provision
23many different accelerators and keep only the accelerators needed for
24an application active while switching off others on the chip to save
25power consumption. While promising, given the fast evolution of
26languages and software, its hard to define a set of fixed-function
27hardware for commodity processors. Furthermore, the toolchain to
28create such customized hardware is itself a hard research
29challenge. We believe that software, applications, and runtime models
30themselves can be refactored to significantly improve the overall
31computing efficiency of commodity processors.
33In this paper, we tackle the infamous ``thirteenth dwarf''
34(parsers/finite state machines) that is considered to be the hardest
35application class to parallelize~\cite{Asanovic:EECS-2006-183} and
36show how Parabix, a novel software architecture, tool chain and
37run-time environment can indeed be used to dramatically improve
38parsing efficiency on commodity processors.  Based on the concept of
39transposing byte-oriented character data into parallel bit streams for
40the individual bits of each byte, the Parabix framework exploits the
41SIMD extensions (SSE/AVX on x86, Neon on ARM) on commodity processors
42to process hundreds of character positions in an input stream
43simultaneously.  To achieve transposition, Parabix exploits
44sophisticated SIMD instructions that enable data elements to be packed
45and unpacked from registers in a regular manner which improve the
46overall cache access behavior of the application resulting in
47significantly fewer misses and better utilization.  Parabix also
48dramatically reduces branches in parsing code resulting in a more
49efficient pipeline and substantially improves register/cache
50utilization which minimizes energy wasted on data transfers.
52We apply Parabix technology to the problem of XML parsing and develop
53several implementations for different computing platforms.  XML is a
54particularly interesting application; it is a standard of the web
55consortium that provides a common framework for encoding and
56communicating data.  XML provides critical data storage for
57applications ranging from Office Open XML in Microsoft Office to NDFD
58XML of the NOAA National Weather Service, from KML in Google Earth to
59Castor XML in the Martian Rovers, a XML data in Android phones.  XML
60parsing efficiency is important for multiple application areas; in
61server workloads the key focus in on overall transactions per second,
62while in applications in the network switches and cell phones, latency
63and the energy are of paramount importance.  Traditional
64software-based XML parsers have many inefficiencies due to complex
65input-dependent branching structures leading to considerable branch
66misprediction penalties as well as poor use of memory bandwidth and
67data caches due to byte-at-a-time processing and multiple buffering.
68XML ASIC chips have been around for over 6 years, but typically lag
69behind CPUs in technology due to cost constraints. Our focus is how
70much can we improve performance of the XML parser on commodity
71processors with Parabix technology.
73In the end, as summarized by
74Figure~\ref{perf-energy} our Parabix-based XML parser improves the
75performance by ?$\times$ and energy efficiency by ?$\times$ compared
76to widely-used software parsers and approaching the performance of
77?$cycles/input-byte$ performance of ASIC XML
78parsers~\cite{}.\footnote{The actual energy consumption of the XML
79  ASIC chips is not published by the companies.}
81Overall we make the following contributions in this paper.
831) We introduce the Parabix architecture, tool chain and run-time
84environment and describe how it may be used to produce efficient
85XML parser implementations on a variety of commodity processors.
86While studied in the context of XML parsing, the Parabix framework
87can be widely applied to many problems in text processing and
902) We compare our Parabix XML parsers against conventional parsers
91and assess the improvement in overall performance and energy efficiency
92on each platform.   We are the first to compare and contrast SSE/AVX extensions across
93multiple generation of Intel processors and show that there are
94performance challenges when using newer generation SIMD extensions,
95possibly due to their memory interface. We compare the ARM Neon
96extensions against the x86 SIMD extensions and comment on the latency of
97SIMD operations across these architectures.
993) Finally, building on the SIMD parallelism of Parabix technology,
100we multithread the Parabix XML parser to to enable the different
101stages in the parser to exploit SIMD units across all the cores.
102This further improves performance while maintaining the energy consumption
103constant with the sequential version.
107Figure~\ref{perf-energy} is an energy-performance scatter plot showing
108the results obtained.
111With all this XML processing, a substantial literature has arisen
112addressing XML processing performance in general and the performance
113of XML parsers in particular.  Nicola and John specifically identified
114XML parsing as a threat to database performance and outlined a number
115of potential directions for potential performance improvements
116\cite{NicolaJohn03}.  The nature of XML APIs was found to have a
117significant affect on performance with event-based SAX (Simple API for
118XML) parsers avoiding the tree construction costs of the more flexible
119DOM (Document Object Model) parsers \cite{Perkins05}.  The commercial
120importance of XML parsing spurred developments of hardware-based
121approaches including the development of a custom XML chip
122\cite{Leventhal2009} as well as FPGA-based implementations
123\cite{DaiNiZhu2010}.  However promising these approaches may be for
124particular niche applications, it is likely that the bulk of the
125world's XML processing workload will be carried out on commodity
126processors using software-based solutions.
128To accelerate XML parsing performance in software, most recent
129work has focused on parallelization.  The use of multicore
130parallelism for chip multiprocessors has attracted
131the attention of several groups \cite{ZhangPanChiu09, ParaDOM2009, LiWangLiuLi2009},
132while SIMD (Single Instruction Multiple Data) parallelism
133has been of interest to Intel in designing new SIMD instructions\cite{XMLSSE42}
134, as well as to the developers of parallel bit stream technology
136Each of these approaches has shown considerable performance
137benefits over traditional sequential parsing techniques that follow the
138byte-at-a-time model.
147\caption{XML Parser Technology Energy vs. Performance}
151The remainder of this paper is organized as follows.
152Section~\ref{section:background} presents background material on XML parsing
153and provides insight into the inefficiency of traditional parsers on
154mainstream processors.  Section~\ref{section:parabix} describes the
155Parabix architecture, tool chain and run-time environment.
156Section~\ref{section:parser} describes the application of the
157Parabix framework to the construction of an XML parser
158meeting enforcing all the well-formedness rules of the XML
159specification.  Section~\ref{section:methodology} then describes
160the overall methodology of our performance and energy study.
161Section~\ref{section:baseline} presents a detailed
162performance evaluation on a \CITHREE\ processor as
163our primary evaluation platform, addressing a number of
164microarchitectural issues including cache misses, branch
165mispredictions, and SIMD instruction counts.  Section~\ref{section:scalability} examines
166scalability and performance gains through three generations of Intel
167architecture.  Section~\ref{section:avx} examines the extension
168of the Parabix technology to take advantage of Intel's new
169256-bit AVX technology, while Section~\ref{section:neon} investigates
170the applications of this technology on mobile platforms using
171ARM processors with Neon SIMD extensions.
172Section~\ref{section:multithread} then looks at the multithreading of the
173Parabix XML parser using pipeline parallelism.
174Section~\ref{section:conclusion} concludes the paper.
179%One area in which both servers and mobile devices devote considerable
180%computational effort into is in the processing of Extensible Markup
181%Language (XML) documents.  It was predicted that corporate servers
182%would see a ``growth in XML traffic\ldots from 15\% [of overall
183%network traffic] in 2004 to just under 48\% by 2008''
184%\cite{coyle2005}.  Further, ``from the point of view of server
185%efficiency[,] XML\ldots is the closest thing there is to a ubiquitous
186%computing workload'' \cite{leventhal2009}.  In other words, XML is the
187%quickly becoming the backbone of most server/server and client/server
188%%information exchanges.  Similarly, there is growing interest in the
189%use of mobile web services for personalization, context-awareness, and
190%content-adaptation of mobile web sites---most of which rely on XML
191%\cite{canali2009}.  Whether the end user realizes it or not, XML is
192%part of their daily life.
194%Why are XML parsers important ?
195%Talk about XML parsers and what they do in general.
196%Brief few lines about byte-at-time ?
197%What's new with Parabix style approach ?
198%Introduce Parabix1 and Parabix2 ?
199%Present overall quantiative improvements compared to other parsers.
Note: See TracBrowser for help on using the repository browser.