# source:docs/HPCA2012/final_ieee/01-intro.tex@1784

Last change on this file since 1784 was 1784, checked in by cameron, 8 years ago

Eliminate duplicate refs; Fluke ref.

File size: 7.0 KB
Line
1\section{Introduction}
2Modern applications ranging from web search to analytics are mainly
3data centric operating over large swaths of information. Information expansion
4and diversification of data has resulted in multiple textual storage
5formats.  Of these, XML is one of the most widely used standards, providing
6a common framework for encoding and communicating data.
7It is used in applications ranging
8from Office Open XML in Microsoft Office to NDFD XML of the NOAA
9National Weather Service, from KML in Google Earth to Castor XML in
10the Martian Rovers. To enable these diverse applications we need high
11performance, scalable, and energy efficient processing techniques
12for textual data in general, and XML, in particular.
13
14%; in server
15%workloads the key focus in on overall transactions per second, while
16%in applications for network switches and cell phones, latency and
17%energy are of paramount importance.
18
19
20Unfortunately, given the limited levels of parallelism that can be
21found in traditional text processing, it is not clear how this important
22class of application
23can benefit from the growth in multicore processors. As a widely cited
24Berkeley study~\cite{Asanovic:EECS-2006-183} reports, text processing
25applications represented by the thirteenth
26dwarf'' (parsers/finite state machines) are
27considered to be the hardest application class to parallelize and
28process efficiently.  Conventional software-based text parsers have
29many inefficiencies including considerable branch misprediction
30penalties due to complex input-dependent branching structures as well
31as poor use of caches and memory bandwidth due to byte-at-a-time
32processing. ASIC chips that process XML textual data have been around
33since early 2003, but typically lag behind CPUs in technology due to
34cost constraints~\cite{Leventhal2009}
35%They also focus mainly on speeding up
36%the computational work alone, without addressing problems of poor memory
37%behaviour.
38
39%
40% Introduce Parabix.
41%
42
43Parallel bit stream (Parabix) technology is a promising new approach
44for high performance text processing. The key insight is based on the transposition of
45byte-oriented character data into parallel bit streams (each with one
46bit per input byte) which permits text processing to exploit SIMD
47operations on modern processors \cite{Cameron2008, CameronHerdyLin2008}.
48Parabix also incorporates a systematic and portable SIMD programming
49framework based on our model inductive doubling instruction set
50architecture~\cite{CameronLin2009}.    Most recently, a new
51parallel scanning primitive \cite{cameron-EuroPar2011} has
52been incorporated into the technology base to form the
53basis of the Pablo compiler documented in this paper.
54
55
56%The first generation of Parabix XML parser~\cite{CameronHerdyLin2008},
57%which applies a sequential bit scan method, has already shown a
59%second generation of Parabix XML parser~\cite{Cameron2010} introduced
60%a new idea, parallel bit scan, which provides us a more efficient
61%scanning and better utilization of the resources.
62
63
64
65%\pagebreak
66In this paper, we generalize parallel bit streams and develop the
67Parabix programming framework to help programmers build text
68processing applications. Programmers specify operations on
69unbounded character lists using bit streams in a python environment.
70Our code generation and runtime system translates them into low-level
71C++ routines.  The Parabix routines exploit the SIMD extensions on
72commodity processors (SSE/AVX on x86, Neon on ARM) to process hundreds
73of character positions in an input stream simultaneously, and dramatically
74improving the execution efficiency. We describe the overall Parabix
75tool chain, a novel execution framework and software build environment
76that enables text processing applications to effectively exploit
77commodity multicores.
78
79\pagebreak
80We study in detail the performance of Parabix technology
81in application to the problem of XML parsing on multiple
82architectures.
83Figure~\ref{perf-energy} showcases the overall efficiency of our
84framework and dramatic improvements in both performance and
85energy efficiency. The Parabix-XML parser exploits
86the bit stream technology to dramatically reduce branches in
87parsing routines and realize a more efficient pipeline execution. It also
88substantially improves register utilization which minimizes energy
89wasted on cache misses and data transfers.\footnote{The actual energy consumption of the XML
91
92
93\begin{figure}[!h]
94\begin{center}
95\includegraphics[trim = 2mm 1mm 1mm 2mm, clip, width=85mm]{plots/performance_energy_chart.pdf}
96\end{center}
97\caption{XML Parser Technology Energy vs. Performance}
98\label{perf-energy}
99\end{figure}
100
101We make the following contributions:
102%
103
1041) We outline the Parabix architecture, code-generation tool chain and
105runtime environment; and describe how it may be used to produce
106efficient XML parser implementations on a variety of commodity
107processors.  While studied in the context of XML parsing, the Parabix
108framework can be widely applied to many problems in text processing
109and parsing.  We have released Parabix as open source and are
110interested in exploring the applications that can take advantage of
111our tool chain (\textit{http://parabix.costar.sfu.ca/}).
112
113
1142) We compare the Parabix XML parser against conventional parsers and
115assess the improvement in overall performance and energy efficiency on
116variety of hardware platforms.  We use Parabix to study and
117contrast SSE/AVX extensions across multiple generation of Intel
118processors and show that there are performance challenges when using
119newer generation SIMD extensions. We compare the ARM Neon extensions
120against the x86 SIMD extensions and comment on the latency of SIMD
121operations.
122
1233) Finally, we multithread the Parabix XML parser to enable the different
124stages in the parser to exploit SIMD units across all the cores.
125This further improves performance while maintaining the energy consumption
126comparable with the sequential version.
127
128
129The remainder of this paper is organized as follows.
130Section~\ref{section:background} presents background material on XML
131parsing and provides insight into the inefficiency of traditional
132parsers.  Section~\ref{section:parabix} describes the Parabix
133architecture, tool chain and runtime environment.
134Section~\ref{section:parser} describes the design of an XML parser
135based on the Parabix framework. Section \ref{section:methodology} details
136our evaluation framework. Section~\ref{section:baseline}
137presents a detailed performance analysis of Parabix on a
138\CITHREE\ system using hardware performance counters.
139Section~\ref{section:scalability} compares the performance and energy
140efficiency of 128-bit SIMD extensions across three generations of
141Intel processors and includes a comparison with the ARM Cortex-A8
142processor.  Section~\ref{section:avx} examines the Intel's new 256-bit
143AVX technology and comments on the benefits and challenges compared to
144the 128-bit SSE instructions.  Finally,