# Changeset 1691 for docs/HPCA2012/01-intro.tex

Ignore:
Timestamp:
Nov 17, 2011, 2:52:12 PM (8 years ago)
Message:

Change Introduction

File:
1 edited

### Legend:

Unmodified
 r1652 \section{Introduction} Classical Dennard voltage scaling enabled us to keep all of transistors afforded by Moore's law active. Dennard scaling reached its limits in 2005 and this has resulted in a rethink of the way general-purpose processors are built: frequencies have remained stagnant over the last 5 years with the capability to boost a core's frequency only if other cores on the chip are shut off. Chip makers strive to achieve energy efficient computing by operating at more optimal core frequencies and aim to increase performance with a larger number of cores. Unfortunately, given the limited levels of parallelism that can be found in applications~\cite{blake-isca-2010}, it is not certain how many cores can be productively used in scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting parallelism across multiple cores tends to require heavy weight threads that are difficult to manage and synchronize. The desire to improve the overall efficiency of computing is pushing designers to explore customized hardware~\cite{venkatesh-asplos-2010, hameed-isca-2010} that accelerate specific parts of an application while reducing the overheads present in general-purpose processors. They seek to exploit the transistor bounty to provision many different accelerators and keep only the accelerators needed for an application active while switching off others on the chip to save power consumption. While promising, given the fast evolution of languages and software, its hard to define a set of fixed-function hardware for commodity processors. Furthermore, the toolchain to create such customized hardware is itself a hard research challenge. We believe that software, applications, and runtime models themselves can be refactored to significantly improve the overall computing efficiency of commodity processors. As a result of information expansion and diversification of the data format, the demands of high performance and energy efficient text processing are rapidly increasing. However, classical Dennard voltage scaling has reached its limits which gives the traditional byte-at-a-time processing methods little space for further improvement. An alternative is to increase energy efficiency by operating at more optimal core frequencies and achieve better performance with a larger number of cores. Unfortunately, given the limited levels of parallelism that can be found in applications~\cite{blake-isca-2010}, especailly text processing, in which, many applications, for example, XML parsing, are sequential by nature, it is not certain how many cores can be productively used in scaling our chips~\cite{esmaeilzadeh-isca-2011}. In a widely cited Berkeley study~\cite{Asanovic:EECS-2006-183}, the infamous thirteenth dwarf'' (parsers/finite state machines) is considered to be the hardest application class to parallelize. In this paper we tackle the infamous thirteenth dwarf'' (parsers/finite state machines) that is considered to be the hardest application class to parallelize~\cite{Asanovic:EECS-2006-183}. We present Parabix, a novel execution framework and software runtime environment that can be used to dramatically improve the efficiency of text processing and parsing on commodity processors.  Parabix transposes byte-oriented character data into parallel bit streams and then exploits the SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM) to process hundreds of character positions in an input stream simultaneously. To transform character-oriented data into bit streams Parabix exploits sophisticated SIMD instructions that enable data elements to be packed into registers. This improves the overall cache behaviour of the application resulting in significantly A new technology, Parabix, was introduced to exploit the SIMD extensions on commodity processors to process hundreds of character positions in an input stream simultaneously~\cite{Cameron2008}. Parabix first transposes byte-oriented character data into parallel bit streams using sophisticated SIMD instructions that enable data elements to be packed into registers. With the bit streams, where each bit represents one character from the input data, the text can then be processed in parallel within the SIMD registers. This improves the overall cache behaviour of the application resulting in significantly fewer misses and better utilization.  Parabix also dramatically reduces branches in the parsing routines resulting in a more efficient the XML parser on commodity processors with Parabix technology. The first generation of Parabix XML parser~\cite{CameronHerdyLin2008}, which applies a sequential bit scan method, has already shown a substantial improvement on performance. The latest version or the second generation of Parabix XML parser~\cite{Cameron2010} introduced a new idea, parallel bit scan, which provides us a more efficient scanning and better utilization of the resources. In this paper, We present Parabix tool chain, a novel execution framework and software runtime environment that can be used to dramatically improve the efficiency of text processing and parsing on commodity processors. Figure~\ref{perf-energy} showcases the overall efficiency of our framework. The Parabix-XML parser improves the