# Changeset 1393 for docs/HPCA2012/01-intro.tex

Ignore:
Timestamp:
Aug 30, 2011, 10:47:59 AM (8 years ago)
Message:

Minor bug fixes up to 04

File:
1 edited

### Legend:

Unmodified
 r1373 \section{Introduction} We have now long since reached the limit to classical Dennard voltage scaling that enabled us to keep all of transistors afforded by Moore's law active. This has already resulted in a rethink of the way scaling that enabled us to keep all of transistors afforded by Moore's law active. This has already resulted in a rethink of the way general-purpose processors are built: processor frequencies have remained stagnant over the last 5 years with the capability to boost core speeds on Intel multicores only if other cores on the chip are shut off. Chip makers strive to achieve energy efficient computing by operating at more optimal core frequencies and aim to increase performance with a larger number of cores. Unfortunately, given the limited levels of parallelism that can be found in applications~\cite{blake-isca-2010}, it is not certain how many cores can be productively used in scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting parallelism across multiple cores tends to require heavy weight threads that are difficult to manage and synchronize. a core's frequency only if other cores on the chip are shut off. Chip makers strive to achieve energy efficient computing by operating at more optimal core frequencies and aim to increase performance with a larger number of cores. Unfortunately, given the limited levels of parallelism that can be found in applications~\cite{blake-isca-2010}, it is not certain how many cores can be productively used in scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting parallelism across multiple cores tends to require heavy weight threads that are difficult to manage and synchronize. The desire to improve the overall efficiency of computing is pushing In this paper, we tackle the infamous thirteenth dwarf'' (parsers/finite state machines) that is considered to be the hardest application class to parallelize~\cite{Asanovic:EECS-2006-183} and show how Parabix, a novel software architecture, tool chain and run-time environment can indeed be used to dramatically improve parsing efficiency on commodity processors.  Based on the concept of transposing byte-oriented character data into parallel bit streams for the individual bits of each byte, the Parabix framework exploits the SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM) application class to parallelize~\cite{Asanovic:EECS-2006-183}. We present Parabix, a novel execution framework and software run-time environment that can be used to dramatically improve the efficiency of text processing and parsing on commodity processors.  Parabix transposes byte-oriented character data into parallel bit streams for the individual bits of each character byte and then exploits the SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM) to process hundreds of character positions in an input stream simultaneously.  To achieve transposition, Parabix exploits applications ranging from Office Open XML in Microsoft Office to NDFD XML of the NOAA National Weather Service, from KML in Google Earth to Castor XML in the Martian Rovers, as well as ubiquitous XML data in Android phones.  XML parsing efficiency is important for multiple application areas; in server workloads the key focus in on overall transactions per second, while in applications in network switches and cell phones, latency and energy are of paramount importance.  Traditional software-based XML parsers have many inefficiencies including considerable branch misprediction penalties due to complex input-dependent branching structures as well as poor use of memory bandwidth and data caches due to byte-at-a-time processing and multiple buffering. XML ASIC chips have been around for over 6 years, but typically lag behind CPUs in technology due to cost constraints. Our focus is how much we can improve performance of the XML parser on commodity processors with Parabix technology. Castor XML in the Martian Rovers.  XML parsing efficiency is important for multiple application areas; in server workloads the key focus in on overall transactions per second, while in applications in network switches and cell phones, latency and energy are of paramount importance.  Traditional software-based XML parsers have many inefficiencies including considerable branch misprediction penalties due to complex input-dependent branching structures as well as poor use of memory bandwidth and data caches due to byte-at-a-time processing and multiple buffering.  XML ASIC chips have been around for over 6 years, but typically lag behind CPUs in technology due to cost constraints. Our focus is how much we can improve performance of the XML parser on commodity processors with Parabix technology. In the end, as summarized by parsing. 2) We compare  Parabix XML parsers against conventional parsers and assess the improvement in overall performance and energy efficiency on each platform.   We are the first to compare and contrast SSE/AVX extensions across multiple generation of Intel processors and show that there are performance challenges when using newer generation SIMD extensions, possibly due to their memory interface. We compare the ARM Neon extensions against the x86 SIMD extensions and comment on the latency of SIMD operations across these architectures. 2) We compare Parabix XML parsers against conventional parsers and assess the improvement in overall performance and energy efficiency on each platform.  We are the first to compare and contrast SSE/AVX extensions across multiple generation of Intel processors and show that there are performance challenges when using newer generation SIMD extensions. We compare the ARM Neon extensions against the x86 SIMD extensions and comment on the latency of SIMD operations across these architectures. 3) Finally, building on the SIMD parallelism of Parabix technology, The remainder of this paper is organized as follows. Section~\ref{section:background} presents background material on XML parsing and provides insight into the inefficiency of traditional parsers on mainstream processors.  Section~\ref{section:parabix} describes the Parabix architecture, tool chain and run-time environment. Section~\ref{section:parser} describes the application of the Parabix framework to the construction of an XML parser enforcing all the well-formedness rules of the XML specification.  Section~\ref{section:methodology} then describes the overall methodology of our performance and energy study. Section~\ref{section:baseline} presents a detailed performance evaluation on a \CITHREE\ processor as our primary evaluation platform, addressing a number of microarchitectural issues including cache misses, branch mispredictions, and SIMD instruction counts.  Section~\ref{section:scalability} examines scalability and performance gains through three generations of Intel architecture.  Section~\ref{section:avx} examines the extension of the Parabix technology to take advantage of Intel's new 256-bit AVX technology, while Section~\ref{section:neon} investigates the applications of this technology on mobile platforms using ARM processors with Neon SIMD extensions. Section~\ref{section:multithread} then looks at the multithreading of the Parabix XML parser using pipeline parallelism. Section~\ref{section:related} discusses related work, after which Section~\ref{section:conclusion} concludes the paper. Section~\ref{section:background} presents background material on XML parsing and provides insight into the inefficiency of traditional parsers on mainstream processors.  Section~\ref{section:parabix} describes the Parabix architecture, tool chain and run-time environment.  Section~\ref{section:parser} describes the application of the Parabix framework to the construction of an XML parser enforcing all the well-formedness rules of the XML specification. Section~\ref{section:baseline} presents a detailed performance analysis of Parabix on a \CITHREE\ system using hardware performance counters and compares it against conventional parsers. Section~\ref{section:scalability} compares the performance and energy efficiency of 128 bit SIMD extensions across three generations of intel processors and includes a comparison with the ARM Cortex-A8 processor.  Section~\ref{section:avx} examines the Intel's new 256-bit AVX technology and comments on the benefits and challenges compared to the 128-bit SSE instructions.  Finally, Section~\ref{section:multithread} looks at the multithreading of the Parabix XML parser which seeks to exploit the SIMD units scattered across multiple cores.