# Changeset 1330 for docs/HPCA2012

Ignore:
Timestamp:
Aug 20, 2011, 5:12:13 PM (8 years ago)
Message:

Revise intro

File:
1 edited

### Legend:

Unmodified
 r1326 \section{Introduction} Classical Dennard Scaling~\cite{} which ensured that voltage scaling would enable us to keep all of transistors afforded by Moore's law active, has currently stopped. This has already resulted in a rethink We have now long since reached the limit to classical Dennard voltage scaling~\cite{}, that enabled us to keep all of transistors afforded by Moore's law active. This has already resulted in a rethink of the way general-purpose processors are built: processor frequencies have remained stagnant over the last 5 years and processor cores in multIntel multicores provide capability to boost core speeds if other cores on the chip are shut-off. Chip makers strive to achieve energy have remained stagnant over the last 5 years with the capability to boost core speeds on Intel multicores only if other cores on the chip are shut off. Chip makers strive to achieve energy efficient computing by operating at more optimal core frequencies and aiming to increase performance with larger number of cores. Unfortunately, given the levels of parallelism~\cite{blake-isca-2010} in applications, that multicores can exploit it is not certain up to how many cores we can continue aim to increase performance with larger number of cores. Unfortunately, given the limited levels of parallelism that can be found in applications~\cite{blake-isca-2010}, it is not certain how many cores can be productively used in scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting parallelism across multiple cores tends to require heavweight threads that are difficult to manage and synchronize. The desire to improve the overall efficiency of computing is pushing processors. They seek to exploit the transistor bounty to provision many different accelerators and keep only the accelerators needed for an application active while switching-off others on the chip to save an application active while switching off others on the chip to save power consumption. While promising, given the fast evolution of languages and software, its hard to define a set of fixed-function computing efficiency of commodity processors. In this paper, we demonstrate with an XML parser that changes to the underlying algorithm and compute model can significantly improve the efficiency on commodity processors. We achieve this efficiency by carefully redesigning the algorithm to exploit Parallel Bitstream runtime framework (Parabix) that exploits the SIMD extensions (SSE/AVX on x86, Neon on ARM) on commodity processors. The Parabix framework exploits modern instructions in the processor ISA that can execute 10s of operations (on multiple chararacter streams) in a single instruction and amortizes the overhead of general-purpose processor. Parabix also minimizes or eliminate branches entirely resulting in a more efficient pipeline and and improves overall register/cache utilization which minimizes energy wasted on data transfers. Parabix SSE/AVX exploits also include sophisticated instructions that enable the algorithm to pack and unpack the data elements from the registers which makes the overall cache access behavior of the application regular resulting in significantly fewer misses and better utilization. Overall as summarized by Figure~\ref{perf-energy} our Parabix-based XML parser improves the performance by ?$\times$ and energy efficiency by ?$\times$ compared to widely-used software parsers and approaching the performance of ?$cycles/input-byte$ performance of ASIC XML parsers~\cite{}.\footnote{The actual energy consumption of the XML ASIC chips is not published by the companies.} In this paper, we tackle the infamous thirteenth dwarf'' (parsers/finite state machines) that is considered to be the hardest application class to parallelize~\cite{Asanovic:EECS-2006-183} and show how Parabix, a novel software architecture, tool chain and run-time environment can indeed be used to dramatically improve parsing efficiency on commodity processors. Based on the concept of transposing byte-oriented character data into parallel bit streams for the individual bits of each byte, the Parabix framework exploits the SIMD extensions (SSE/AVX on x86, Neon on ARM) on commodity processors to process hundreds of character positions in an input stream simultaneously.  To achieve transposition, Parabix exploits sophisticated SIMD instructions that enable data elements to be packed and unpacked from registers in a regular manner which improve the overall cache access behavior of the application resulting in significantly fewer misses and better utilization. Parabix also dramatically reduces branches in parsing code resulting in a more efficient pipeline and substantially improves register/cache utilization which minimizes energy wasted on data transfers. We study Parabix technology in application to the problem of XML parsing and develop several implementations for different computing platforms. XML parser on commodity processors with Parabix technology. In the end, as summarized by Figure~\ref{perf-energy} our Parabix-based XML parser improves the performance by ?$\times$ and energy efficiency by ?$\times$ compared to widely-used software parsers and approaching the performance of ?$cycles/input-byte$ performance of ASIC XML parsers~\cite{}.\footnote{The actual energy consumption of the XML ASIC chips is not published by the companies.} Overall we make the following contributions in this paper. 1) We develop an XML parser that demonstrates the impact of redesigning the core of an application to make more efficient use of commodity processors. We compare the Parabix-XML parser against conventional parsers and demonstrate the improvement in overall performance and energy efficiency. We also paralleillize the Parabix-XML parser to enable the different stages in the parser to exploit SIMD units across all the cores. This further improves performance while maintaining the energy consumption constant with the sequential version. 2) We are the first to compare and contrast SSE/AVX extensions across 1) We introduce the Parabix architecture, tool chain and run-time environment and describe how it may be used to produce efficient XML parser implementations on a variety of commodity processors. While studied in the context of XML parsing, the Parabix framework can be widely applied to many problems in text processing and parsing. 2) We compare our Parabix XML parsers against conventional parsers and assess the improvement in overall performance and energy efficiency on each platform.   We are the first to compare and contrast SSE/AVX extensions across multiple generation of Intel processors and show that there are performance challenges when using newer generation SIMD extensions, possibly due to their memory interface. We compare ARM's Neon again x86's SIMD extensions and comment on the latency of SIMD operations across these architectures. 3) Finally, we introduce a runtime framework, \textit{Parabix}, that abstracts the SIMD specifics of the machine (e.g., register widths) and provides a language framework to enable applications to run efficiently on commodity processors. Parabix enables the general-purpose multicores to be used efficiently by an entirely new class of applications, text processing and parsing. possibly due to their memory interface. We compare the ARM Neon extensions against the x86 SIMD extensions and comment on the latency of SIMD operations across these architectures. 3) Finally, building on the SIMD parallelism of Parabix technology, we multithread the Parabix XML parser to to enable the different stages in the parser to exploit SIMD units across all the cores. This further improves performance while maintaining the energy consumption constant with the sequential version.