source: docs/HPCA2012/01-intro.tex @ 1629

Last change on this file since 1629 was 1629, checked in by ksherdy, 8 years ago

Minor edit.

File size: 8.3 KB
Line 
1\section{Introduction}
2Classical Dennard voltage scaling enabled us to keep all of transistors
3afforded by Moore's law active. Dennard scaling reached its limits in
42005 and this has resulted in a rethink of the way general-purpose
5processors are built: frequencies have remained stagnant
6over the last 5 years with the capability to boost a core's frequency
7only if other cores on the chip are shut off. Chip makers strive to
8achieve energy efficient computing by operating at more optimal core
9frequencies and aim to increase performance with a larger number of
10cores. Unfortunately, given the limited levels of parallelism that can
11be found in applications~\cite{blake-isca-2010}, it is not certain how
12many cores can be productively used in scaling our
13chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting
14parallelism across multiple cores tends to require heavy weight
15threads that are difficult to manage and synchronize.
16
17The desire to improve the overall efficiency of computing is pushing
18designers to explore customized hardware~\cite{venkatesh-asplos-2010,
19  hameed-isca-2010} that accelerate specific parts of an application
20while reducing the overheads present in general-purpose
21processors. They seek to exploit the transistor bounty to provision
22many different accelerators and keep only the accelerators needed for
23an application active while switching off others on the chip to save
24power consumption. While promising, given the fast evolution of
25languages and software, its hard to define a set of fixed-function
26hardware for commodity processors. Furthermore, the toolchain to
27create such customized hardware is itself a hard research
28challenge. We believe that software, applications, and runtime models
29themselves can be refactored to significantly improve the overall
30computing efficiency of commodity processors.
31
32In this paper we tackle the infamous ``thirteenth dwarf''
33(parsers/finite state machines) that is considered to be the hardest
34application class to parallelize~\cite{Asanovic:EECS-2006-183}. We
35present Parabix, a novel execution framework and software runtime
36environment that can be used to dramatically improve the efficiency of
37text processing and parsing on commodity processors.  Parabix
38transposes byte-oriented character data into parallel bit streams and
39then exploits the SIMD extensions on commodity processors (SSE/AVX on
40x86, Neon on ARM) to process hundreds of character positions in an
41input stream simultaneously. To transform character-oriented data into
42bit streams Parabix exploits sophisticated SIMD instructions that
43enable data elements to be packed into registers. This improves the
44overall cache behaviour of the application resulting in significantly
45fewer misses and better utilization.  Parabix also dramatically
46reduces branches in the parsing routines resulting in a more efficient
47pipeline and substantially improves register utilization which
48minimizes energy wasted on data transfers.
49
50We apply Parabix technology to the problem of XML parsing.  XML is a
51standard of the web consortium that provides a common framework for
52encoding and communicating data.  XML provides critical data storage
53for applications ranging from Office Open XML in Microsoft Office to
54NDFD XML of the NOAA National Weather Service, from KML in Google
55Earth to Castor XML in the Martian Rovers.  XML parsing efficiency is
56important for multiple application areas; in server workloads the key
57focus in on overall transactions per second, while in applications for
58network switches and cell phones, latency and energy are of paramount
59importance.  Conventional software-based XML parsers have many
60inefficiencies including considerable branch misprediction penalties
61due to complex input-dependent branching structures as well as poor
62use of caches and memory bandwidth due to byte-at-a-time
63processing.  XML ASIC chips have been around since early 2003, but typically lag behind CPUs in technology due to
64cost constraints~\cite{xmlchip}. They also focus mainly on speeding up the parser
65computation itself and are limited by the poor memory behaviour.
66Our focus is how much we can improve performance of
67the XML parser on commodity processors with Parabix technology.
68
69
70
71
72\begin{figure}
73\begin{center}
74\includegraphics[width=85mm]{plots/performance_energy_chart.pdf}
75\end{center}
76\caption{XML Parser Technology Energy vs. Performance}
77\label{perf-energy}
78\end{figure}
79
80
81
82Figure~\ref{perf-energy} showcases the overall efficiency of our
83framework. The Parabix-XML parser improves the
84performance %by ?$\times$
85and energy efficiency %by ?$\times$
86several-fold compared
87to widely-used software parsers, approaching the
88%?$cycles/input-byte$
89performance of ASIC XML
90parsers~\cite{xmlchip,DaiNiZhu2010}.
91\footnote{The actual energy consumption of the XML
92  ASIC chips is not published by the companies.}
93%
94Overall we make the following contributions:
95
961) We outline the Parabix architecture, tool chain and runtime
97environment and describe how it may be used to produce efficient
98XML parser implementations on a variety of commodity processors.
99While studied in the context of XML parsing, the Parabix framework
100can be widely applied to many problems in text processing and
101parsing.  We have released Parabix completely open source
102and are interested in exploring the applications that can take
103advantage of our tool chain (\textit{http://parabix.costar.sfu.ca/}).
104
105
1062) We compare the Parabix XML parser against conventional parsers and
107assess the improvement in overall performance and energy efficiency on
108variety of hardware platforms.  We are the first to compare and
109contrast SSE/AVX extensions across multiple generation of Intel
110processors and show that there are performance challenges when using
111newer generation SIMD extensions. We compare the ARM Neon extensions
112against the x86 SIMD extensions and comment on the latency of SIMD
113operations.
114
1153) Finally, building on the SIMD parallelism of Parabix technology,
116we multithread the Parabix XML parser to enable the different
117stages in the parser to exploit SIMD units across all the cores.
118This further improves performance while maintaining the energy consumption
119constant with the sequential version.
120
121
122The remainder of this paper is organized as follows.
123Section~\ref{section:background} presents background material on XML
124parsing and provides insight into the inefficiency of traditional
125parsers.  Section~\ref{section:parabix} describes the Parabix
126architecture, tool chain and runtime environment.
127Section~\ref{section:parser} describes the design of an XML parser
128based on the Parabix framework.  Section~\ref{section:baseline}
129presents a detailed performance analysis of Parabix on a
130\CITHREE\ system using hardware performance counters.
131Section~\ref{section:scalability} compares the performance and energy
132efficiency of 128 bit SIMD extensions across three generations of
133Intel processors and includes a comparison with the ARM Cortex-A8
134processor.  Section~\ref{section:avx} examines the Intel's new 256-bit
135AVX technology and comments on the benefits and challenges compared to
136the 128-bit SSE instructions.  Finally,
137Section~\ref{section:multithread} looks at the multithreading of the
138Parabix XML parser which seeks to exploit the SIMD units scattered
139across multiple cores.
140
141
142
143
144
145
146%One area in which both servers and mobile devices devote considerable
147%computational effort into is in the processing of Extensible Markup
148%Language (XML) documents.  It was predicted that corporate servers
149%would see a ``growth in XML traffic\ldots from 15\% [of overall
150%network traffic] in 2004 to just under 48\% by 2008''
151%\cite{coyle2005}.  Further, ``from the point of view of server
152%efficiency[,] XML\ldots is the closest thing there is to a ubiquitous
153%computing workload'' \cite{leventhal2009}.  In other words, XML is the
154%quickly becoming the backbone of most server/server and client/server
155%%information exchanges.  Similarly, there is growing interest in the
156%use of mobile web services for personalization, context-awareness, and
157%content-adaptation of mobile web sites---most of which rely on XML
158%\cite{canali2009}.  Whether the end user realizes it or not, XML is
159%part of their daily life.
160
161%Why are XML parsers important ?
162%Talk about XML parsers and what they do in general.
163%Brief few lines about byte-at-time ?
164%What's new with Parabix style approach ?
165%Introduce Parabix1 and Parabix2 ?
166%Present overall quantiative improvements compared to other parsers.
167
168
169
170
Note: See TracBrowser for help on using the repository browser.