source: docs/HPCA2012/final_ieee/01-intro.tex @ 1779

Last change on this file since 1779 was 1779, checked in by cameron, 8 years ago

Related work

File size: 7.0 KB
Line 
1\section{Introduction}
2Modern applications ranging from web search to analytics are mainly
3data centric operating over large swaths of information. Information expansion
4and diversification of data has resulted in multiple textual storage
5formats.  Of these, XML is one of the most widely used standards, providing
6a common framework for encoding and communicating data. 
7It is used in applications ranging
8from Office Open XML in Microsoft Office to NDFD XML of the NOAA
9National Weather Service, from KML in Google Earth to Castor XML in
10the Martian Rovers. To enable these diverse applications we need high
11performance, scalable, and energy efficient processing techniques
12for textual data in general, and XML, in particular.
13
14%; in server
15%workloads the key focus in on overall transactions per second, while
16%in applications for network switches and cell phones, latency and
17%energy are of paramount importance.
18
19
20Unfortunately, given the limited levels of parallelism that can be
21found in traditional text processing, it is not clear how this important
22class of application
23can benefit from the growth in multicore processors. As a widely cited
24Berkeley study~\cite{Asanovic:EECS-2006-183} reports, text processing
25applications represented by the ``thirteenth
26dwarf'' (parsers/finite state machines) are
27considered to be the hardest application class to parallelize and
28process efficiently.  Conventional software-based text parsers have
29many inefficiencies including considerable branch misprediction
30penalties due to complex input-dependent branching structures as well
31as poor use of caches and memory bandwidth due to byte-at-a-time
32processing. ASIC chips that process XML textual data have been around
33since early 2003, but typically lag behind CPUs in technology due to
34cost constraints~\cite{xmlchip}
35%They also focus mainly on speeding up
36%the computational work alone, without addressing problems of poor memory
37%behaviour.
38
39%
40% Introduce Parabix.
41%
42
43Parallel bit stream (Parabix) technology is a promising new approach
44for high performance text processing. The key insight is based on the transposition of
45byte-oriented character data into parallel bit streams (each with one
46bit per input byte) which permits text processing to exploit SIMD
47operations on modern processors \cite{Cameron2008, CameronHerdyLin2008}.
48Parabix also incorporates a systematic and portable SIMD programming
49framework based on our model inductive doubling instruction set
50architecture~\cite{CameronLin2009}.    Most recently, a new
51parallel scanning primitive \cite{cameron-EuroPar2011} has
52been incorporated into the technology base to form the
53basis of the Pablo compiler documented in this paper.
54
55
56%The first generation of Parabix XML parser~\cite{CameronHerdyLin2008},
57%which applies a sequential bit scan method, has already shown a
58%substantial improvement on performance. The latest version or the
59%second generation of Parabix XML parser~\cite{Cameron2010} introduced
60%a new idea, parallel bit scan, which provides us a more efficient
61%scanning and better utilization of the resources.
62
63
64\begin{figure}
65\begin{center}
66\includegraphics[width=85mm]{plots/performance_energy_chart.pdf}
67\end{center}
68\caption{XML Parser Technology Energy vs. Performance}
69\label{perf-energy}
70\end{figure}
71
72
73In this paper, we generalize parallel bit streams and develop the
74Parabix programming framework to help programmers build text
75processing applications. Programmers specify operations on
76unbounded character lists using bit streams in a python environment.
77Our code generation and runtime system translates them into low-level
78C++ routines.  The Parabix routines exploit the SIMD extensions on
79commodity processors (SSE/AVX on x86, Neon on ARM) to process hundreds
80of character positions in an input stream simultaneously, and dramatically
81improving the execution efficiency. We describe the overall Parabix
82tool chain, a novel execution framework and software build environment
83that enables text processing applications to effectively exploit
84commodity multicores.
85
86We study in detail the performance of Parabix technology
87in application to the problem of XML parsing on multiple
88architectures.
89Figure~\ref{perf-energy} showcases the overall efficiency of our
90framework and dramatic improvements in both performance and
91energy efficiency. The Parabix-XML parser exploits
92the bit stream technology to dramatically reduce branches in
93parsing routines and realize a more efficient pipeline execution. It also
94substantially improves register utilization which minimizes energy
95wasted on cache misses and data transfers.\footnote{The actual energy consumption of the XML
96  ASIC chips is not published by the companies.} 
97
98We make the following contributions:
99%
100
1011) We outline the Parabix architecture, code-generation tool chain and
102runtime environment; and describe how it may be used to produce
103efficient XML parser implementations on a variety of commodity
104processors.  While studied in the context of XML parsing, the Parabix
105framework can be widely applied to many problems in text processing
106and parsing.  We have released Parabix as open source and are
107interested in exploring the applications that can take advantage of
108our tool chain (\textit{http://parabix.costar.sfu.ca/}).
109
110
1112) We compare the Parabix XML parser against conventional parsers and
112assess the improvement in overall performance and energy efficiency on
113variety of hardware platforms.  We use Parabix to study and
114contrast SSE/AVX extensions across multiple generation of Intel
115processors and show that there are performance challenges when using
116newer generation SIMD extensions. We compare the ARM Neon extensions
117against the x86 SIMD extensions and comment on the latency of SIMD
118operations.
119
1203) Finally, we multithread the Parabix XML parser to enable the different
121stages in the parser to exploit SIMD units across all the cores.
122This further improves performance while maintaining the energy consumption
123comparable with the sequential version.
124
125
126The remainder of this paper is organized as follows.
127Section~\ref{section:background} presents background material on XML
128parsing and provides insight into the inefficiency of traditional
129parsers.  Section~\ref{section:parabix} describes the Parabix
130architecture, tool chain and runtime environment.
131Section~\ref{section:parser} describes the design of an XML parser
132based on the Parabix framework. Section \ref{section:methodology} details
133our evaluation framework. Section~\ref{section:baseline}
134presents a detailed performance analysis of Parabix on a
135\CITHREE\ system using hardware performance counters.
136Section~\ref{section:scalability} compares the performance and energy
137efficiency of 128-bit SIMD extensions across three generations of
138Intel processors and includes a comparison with the ARM Cortex-A8
139processor.  Section~\ref{section:avx} examines the Intel's new 256-bit
140AVX technology and comments on the benefits and challenges compared to
141the 128-bit SSE instructions.  Finally,
142Section~\ref{section:multithread} looks at multithreading to exploit
143the SIMD units scattered across multiple cores.
144
145
146
Note: See TracBrowser for help on using the repository browser.