source: docs/HPCA2012/01-intro.tex @ 1395

Last change on this file since 1395 was 1395, checked in by ashriram, 8 years ago

Changed intro 1st line

File size: 10.1 KB
Line 
1\section{Introduction}
2Classical Dennard voltage scaling enabled us to keep all of transistors
3afforded by Moore's law active. Dennard scaling reached its limits in
42005 and this has resulted in a rethink of the way general-purpose
5processors are built: frequencies have remained stagnant
6over the last 5 years with the capability to boost a core's frequency
7only if other cores on the chip are shut off. Chip makers strive to
8achieve energy efficient computing by operating at more optimal core
9frequencies and aim to increase performance with a larger number of
10cores. Unfortunately, given the limited levels of parallelism that can
11be found in applications~\cite{blake-isca-2010}, it is not certain how
12many cores can be productively used in scaling our
13chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting
14parallelism across multiple cores tends to require heavy weight
15threads that are difficult to manage and synchronize.
16
17The desire to improve the overall efficiency of computing is pushing
18designers to explore customized hardware~\cite{venkatesh-asplos-2010,
19  hameed-isca-2010} that accelerate specific parts of an application
20while reducing the overheads present in general-purpose
21processors. They seek to exploit the transistor bounty to provision
22many different accelerators and keep only the accelerators needed for
23an application active while switching off others on the chip to save
24power consumption. While promising, given the fast evolution of
25languages and software, its hard to define a set of fixed-function
26hardware for commodity processors. Furthermore, the toolchain to
27create such customized hardware is itself a hard research
28challenge. We believe that software, applications, and runtime models
29themselves can be refactored to significantly improve the overall
30computing efficiency of commodity processors.
31
32In this paper, we tackle the infamous ``thirteenth dwarf''
33(parsers/finite state machines) that is considered to be the hardest
34application class to parallelize~\cite{Asanovic:EECS-2006-183}. We
35present Parabix, a novel execution framework and software run-time
36environment that can be used to dramatically improve the efficiency of
37text processing and parsing on commodity processors.  Parabix
38transposes byte-oriented character data into parallel bit streams for
39the individual bits of each character byte and then exploits the
40SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM)
41to process hundreds of character positions in an input stream
42simultaneously.  To achieve transposition, Parabix exploits
43sophisticated SIMD instructions that enable data elements to be packed
44and unpacked from registers in a regular manner which improves the
45overall cache access behavior of the application resulting in
46significantly fewer misses and better utilization.  Parabix also
47dramatically reduces branches in parsing code resulting in a more
48efficient pipeline and substantially improves register/cache
49utilization which minimizes energy wasted on data transfers.
50
51We apply Parabix technology to the problem of XML parsing and develop
52several implementations for different computing platforms.  XML is a
53particularly interesting application; it is a standard of the web
54consortium that provides a common framework for encoding and
55communicating data.  XML provides critical data storage for
56applications ranging from Office Open XML in Microsoft Office to NDFD
57XML of the NOAA National Weather Service, from KML in Google Earth to
58Castor XML in the Martian Rovers.  XML parsing efficiency is important
59for multiple application areas; in server workloads the key focus in
60on overall transactions per second, while in applications in network
61switches and cell phones, latency and energy are of paramount
62importance.  Traditional software-based XML parsers have many
63inefficiencies including considerable branch misprediction penalties
64due to complex input-dependent branching structures as well as poor
65use of memory bandwidth and data caches due to byte-at-a-time
66processing and multiple buffering.  XML ASIC chips have been around
67for over 6 years, but typically lag behind CPUs in technology due to
68cost constraints. Our focus is how much we can improve performance of
69the XML parser on commodity processors with Parabix technology.
70
71In the end, as summarized by
72Figure~\ref{perf-energy} our Parabix-based XML parser improves the
73performance %by ?$\times$
74and energy efficiency %by ?$\times$
75several-fold compared
76to widely-used software parsers, approaching the
77%?$cycles/input-byte$
78performance of ASIC XML
79parsers.%~\cite{}.
80\footnote{The actual energy consumption of the XML
81  ASIC chips is not published by the companies.}
82
83Overall we make the following contributions in this paper.
84
851) We outline the Parabix architecture, tool chain and run-time
86environment and describe how it may be used to produce efficient
87XML parser implementations on a variety of commodity processors.
88While studied in the context of XML parsing, the Parabix framework
89can be widely applied to many problems in text processing and
90parsing.
91
922) We compare Parabix XML parsers against conventional parsers and
93assess the improvement in overall performance and energy efficiency on
94each platform.  We are the first to compare and contrast SSE/AVX
95extensions across multiple generation of Intel processors and show
96that there are performance challenges when using newer generation SIMD
97extensions. We compare the ARM Neon extensions against the x86 SIMD
98extensions and comment on the latency of SIMD operations across these
99architectures.
100
1013) Finally, building on the SIMD parallelism of Parabix technology,
102we multithread the Parabix XML parser to to enable the different
103stages in the parser to exploit SIMD units across all the cores.
104This further improves performance while maintaining the energy consumption
105constant with the sequential version.
106
107
108\begin{comment}
109Figure~\ref{perf-energy} is an energy-performance scatter plot showing
110the results obtained.
111
112
113With all this XML processing, a substantial literature has arisen
114addressing XML processing performance in general and the performance
115of XML parsers in particular.  Nicola and John specifically identified
116XML parsing as a threat to database performance and outlined a number
117of potential directions for potential performance improvements
118\cite{NicolaJohn03}.  The nature of XML APIs was found to have a
119significant affect on performance with event-based SAX (Simple API for
120XML) parsers avoiding the tree construction costs of the more flexible
121DOM (Document Object Model) parsers \cite{Perkins05}.  The commercial
122importance of XML parsing spurred developments of hardware-based
123approaches including the development of a custom XML chip
124\cite{Leventhal2009} as well as FPGA-based implementations
125\cite{DaiNiZhu2010}.  However promising these approaches may be for
126particular niche applications, it is likely that the bulk of the
127world's XML processing workload will be carried out on commodity
128processors using software-based solutions.
129
130To accelerate XML parsing performance in software, most recent
131work has focused on parallelization.  The use of multicore
132parallelism for chip multiprocessors has attracted
133the attention of several groups \cite{ZhangPanChiu09, ParaDOM2009, LiWangLiuLi2009},
134while SIMD (Single Instruction Multiple Data) parallelism
135has been of interest to Intel in designing new SIMD instructions\cite{XMLSSE42}
136, as well as to the developers of parallel bit stream technology
137\cite{CameronHerdyLin2008,Cameron2009,Cameron2010}.
138Each of these approaches has shown considerable performance
139benefits over traditional sequential parsing techniques that follow the
140byte-at-a-time model.
141\end{comment}
142
143
144
145\begin{figure}
146\begin{center}
147\includegraphics[width=85mm]{plots/performance_energy_chart.pdf}
148\end{center}
149\caption{XML Parser Technology Energy vs. Performance}
150\label{perf-energy}
151\end{figure}
152
153The remainder of this paper is organized as follows.
154Section~\ref{section:background} presents background material on XML
155parsing and provides insight into the inefficiency of traditional
156parsers on mainstream processors.  Section~\ref{section:parabix}
157describes the Parabix architecture, tool chain and run-time
158environment.  Section~\ref{section:parser} describes the application
159of the Parabix framework to the construction of an XML parser
160enforcing all the well-formedness rules of the XML specification.
161Section~\ref{section:baseline} presents a detailed performance
162analysis of Parabix on a \CITHREE\ system using hardware performance
163counters and compares it against conventional parsers.
164Section~\ref{section:scalability} compares the performance and energy
165efficiency of 128 bit SIMD extensions across three generations of
166intel processors and includes a comparison with the ARM Cortex-A8
167processor.  Section~\ref{section:avx} examines the Intel's new 256-bit
168AVX technology and comments on the benefits and challenges compared to
169the 128-bit SSE instructions.  Finally,
170Section~\ref{section:multithread} looks at the multithreading of the
171Parabix XML parser which seeks to exploit the SIMD units scattered
172across multiple cores.
173
174
175
176
177%One area in which both servers and mobile devices devote considerable
178%computational effort into is in the processing of Extensible Markup
179%Language (XML) documents.  It was predicted that corporate servers
180%would see a ``growth in XML traffic\ldots from 15\% [of overall
181%network traffic] in 2004 to just under 48\% by 2008''
182%\cite{coyle2005}.  Further, ``from the point of view of server
183%efficiency[,] XML\ldots is the closest thing there is to a ubiquitous
184%computing workload'' \cite{leventhal2009}.  In other words, XML is the
185%quickly becoming the backbone of most server/server and client/server
186%%information exchanges.  Similarly, there is growing interest in the
187%use of mobile web services for personalization, context-awareness, and
188%content-adaptation of mobile web sites---most of which rely on XML
189%\cite{canali2009}.  Whether the end user realizes it or not, XML is
190%part of their daily life.
191
192%Why are XML parsers important ?
193%Talk about XML parsers and what they do in general.
194%Brief few lines about byte-at-time ?
195%What's new with Parabix style approach ?
196%Introduce Parabix1 and Parabix2 ?
197%Present overall quantiative improvements compared to other parsers.
198
199
200
201
Note: See TracBrowser for help on using the repository browser.