# source:docs/HPCA2012/final_ieee/01-intro.tex@1779

Last change on this file since 1779 was 1779, checked in by cameron, 8 years ago

Related work

File size: 7.0 KB
RevLine
[1733]1\section{Introduction}
[1743]2Modern applications ranging from web search to analytics are mainly
[1774]3data centric operating over large swaths of information. Information expansion
[1743]4and diversification of data has resulted in multiple textual storage
[1752]5formats.  Of these, XML is one of the most widely used standards, providing
6a common framework for encoding and communicating data.
7It is used in applications ranging
[1743]8from Office Open XML in Microsoft Office to NDFD XML of the NOAA
9National Weather Service, from KML in Google Earth to Castor XML in
10the Martian Rovers. To enable these diverse applications we need high
[1752]11performance, scalable, and energy efficient processing techniques
12for textual data in general, and XML, in particular.
[1733]13
[1743]14%; in server
15%workloads the key focus in on overall transactions per second, while
16%in applications for network switches and cell phones, latency and
17%energy are of paramount importance.
[1733]18
19
[1743]20Unfortunately, given the limited levels of parallelism that can be
[1752]21found in traditional text processing, it is not clear how this important
22class of application
[1743]23can benefit from the growth in multicore processors. As a widely cited
[1752]24Berkeley study~\cite{Asanovic:EECS-2006-183} reports, text processing
25applications represented by the thirteenth
26dwarf'' (parsers/finite state machines) are
[1743]27considered to be the hardest application class to parallelize and
[1752]28process efficiently.  Conventional software-based text parsers have
[1743]29many inefficiencies including considerable branch misprediction
30penalties due to complex input-dependent branching structures as well
31as poor use of caches and memory bandwidth due to byte-at-a-time
32processing. ASIC chips that process XML textual data have been around
33since early 2003, but typically lag behind CPUs in technology due to
[1752]34cost constraints~\cite{xmlchip}
35%They also focus mainly on speeding up
36%the computational work alone, without addressing problems of poor memory
37%behaviour.
[1733]38
[1743]39%
40% Introduce Parabix.
41%
42
[1752]43Parallel bit stream (Parabix) technology is a promising new approach
[1768]44for high performance text processing. The key insight is based on the transposition of
45byte-oriented character data into parallel bit streams (each with one
46bit per input byte) which permits text processing to exploit SIMD
[1779]47operations on modern processors \cite{Cameron2008, CameronHerdyLin2008}.
48Parabix also incorporates a systematic and portable SIMD programming
49framework based on our model inductive doubling instruction set
50architecture~\cite{CameronLin2009}.    Most recently, a new
51parallel scanning primitive \cite{cameron-EuroPar2011} has
52been incorporated into the technology base to form the
53basis of the Pablo compiler documented in this paper.
[1743]54
55
56%The first generation of Parabix XML parser~\cite{CameronHerdyLin2008},
57%which applies a sequential bit scan method, has already shown a
58%substantial improvement on performance. The latest version or the
59%second generation of Parabix XML parser~\cite{Cameron2010} introduced
60%a new idea, parallel bit scan, which provides us a more efficient
61%scanning and better utilization of the resources.
62
63
[1733]64\begin{figure}
65\begin{center}
66\includegraphics[width=85mm]{plots/performance_energy_chart.pdf}
67\end{center}
68\caption{XML Parser Technology Energy vs. Performance}
69\label{perf-energy}
70\end{figure}
71
72
[1774]73In this paper, we generalize parallel bit streams and develop the
[1743]74Parabix programming framework to help programmers build text
[1774]75processing applications. Programmers specify operations on
76unbounded character lists using bit streams in a python environment.
77Our code generation and runtime system translates them into low-level
[1743]78C++ routines.  The Parabix routines exploit the SIMD extensions on
79commodity processors (SSE/AVX on x86, Neon on ARM) to process hundreds
[1774]80of character positions in an input stream simultaneously, and dramatically
[1743]81improving the execution efficiency. We describe the overall Parabix
[1774]82tool chain, a novel execution framework and software build environment
[1743]83that enables text processing applications to effectively exploit
84commodity multicores.
85
[1752]86We study in detail the performance of Parabix technology
87in application to the problem of XML parsing on multiple
88architectures.
[1733]89Figure~\ref{perf-energy} showcases the overall efficiency of our
[1774]90framework and dramatic improvements in both performance and
[1752]91energy efficiency. The Parabix-XML parser exploits
[1774]92the bit stream technology to dramatically reduce branches in
93parsing routines and realize a more efficient pipeline execution. It also
[1743]94substantially improves register utilization which minimizes energy
[1751]95wasted on cache misses and data transfers.\footnote{The actual energy consumption of the XML
[1752]96  ASIC chips is not published by the companies.}
[1743]97
[1752]98We make the following contributions:
[1733]99%
100
[1743]1011) We outline the Parabix architecture, code-generation tool chain and
[1774]102runtime environment; and describe how it may be used to produce
[1743]103efficient XML parser implementations on a variety of commodity
104processors.  While studied in the context of XML parsing, the Parabix
105framework can be widely applied to many problems in text processing
[1752]106and parsing.  We have released Parabix as open source and are
[1743]107interested in exploring the applications that can take advantage of
108our tool chain (\textit{http://parabix.costar.sfu.ca/}).
[1733]109
110
1112) We compare the Parabix XML parser against conventional parsers and
112assess the improvement in overall performance and energy efficiency on
[1743]113variety of hardware platforms.  We use Parabix to study and
[1733]114contrast SSE/AVX extensions across multiple generation of Intel
115processors and show that there are performance challenges when using
116newer generation SIMD extensions. We compare the ARM Neon extensions
117against the x86 SIMD extensions and comment on the latency of SIMD
118operations.
119
[1743]1203) Finally, we multithread the Parabix XML parser to enable the different
[1733]121stages in the parser to exploit SIMD units across all the cores.
122This further improves performance while maintaining the energy consumption
[1743]123comparable with the sequential version.
[1733]124
125
126The remainder of this paper is organized as follows.
127Section~\ref{section:background} presents background material on XML
128parsing and provides insight into the inefficiency of traditional
129parsers.  Section~\ref{section:parabix} describes the Parabix
130architecture, tool chain and runtime environment.
131Section~\ref{section:parser} describes the design of an XML parser
[1774]132based on the Parabix framework. Section \ref{section:methodology} details
133our evaluation framework. Section~\ref{section:baseline}
[1733]134presents a detailed performance analysis of Parabix on a
135\CITHREE\ system using hardware performance counters.
136Section~\ref{section:scalability} compares the performance and energy
137efficiency of 128-bit SIMD extensions across three generations of
138Intel processors and includes a comparison with the ARM Cortex-A8
139processor.  Section~\ref{section:avx} examines the Intel's new 256-bit
140AVX technology and comments on the benefits and challenges compared to
141the 128-bit SSE instructions.  Finally,