# Changeset 1743 for docs/HPCA2012/final_ieee/01-intro.tex

Ignore:
Timestamp:
Nov 30, 2011, 11:30:44 AM (8 years ago)
Message:

First pass final version [ashriram]

File:
1 edited

Unmodified
Removed
• ## docs/HPCA2012/final_ieee/01-intro.tex

 r1733 \section{Introduction} Modern applications ranging from web search to analytics are mainly data centric operating large swathes of data. Information expansion and diversification of data has resulted in multiple textual storage formats. XML is a widely-used text-based data storage format. XML is a standard of the web consortium that provides a common framework for encoding and communicating data.  It is used in applications ranging from Office Open XML in Microsoft Office to NDFD XML of the NOAA National Weather Service, from KML in Google Earth to Castor XML in the Martian Rovers. To enable these diverse applications we need high performance, scalable, and energy efficient text processing stored in these XML documents. As a result of information expansion and diversification of the data format, the demands of high performance and energy efficient text processing are rapidly increasing. However, classical Dennard voltage scaling has reached its limits which gives the traditional byte-at-a-time processing methods little space for further improvement. An alternative is to increase energy efficiency by operating at more optimal core frequencies and achieve better performance with a larger number of cores. Unfortunately, given the limited levels of parallelism that can be found in applications~\cite{blake-isca-2010}, especailly text processing, in which, many applications, for example, XML parsing, are sequential by nature, it is not certain how many cores can be productively used in scaling our chips~\cite{esmaeilzadeh-isca-2011}. In a widely cited Berkeley study~\cite{Asanovic:EECS-2006-183}, the infamous thirteenth dwarf'' (parsers/finite state machines) is considered to be the hardest application class to parallelize. %; in server %workloads the key focus in on overall transactions per second, while %in applications for network switches and cell phones, latency and %energy are of paramount importance. A new technology, Parabix, was introduced to exploit the SIMD extensions on commodity processors to process hundreds of character positions in an input stream simultaneously~\cite{Cameron2008}. Parabix first transposes byte-oriented character data into parallel bit streams using sophisticated SIMD instructions that enable data elements to be packed into registers. With the bit streams, where each bit represents one character from the input data, the text can then be processed in parallel within the SIMD registers. This improves the overall cache behaviour of the application resulting in significantly fewer misses and better utilization.  Parabix also dramatically reduces branches in the parsing routines resulting in a more efficient pipeline and substantially improves register utilization which minimizes energy wasted on data transfers. We apply Parabix technology to the problem of XML parsing.  XML is a standard of the web consortium that provides a common framework for encoding and communicating data.  XML provides critical data storage for applications ranging from Office Open XML in Microsoft Office to NDFD XML of the NOAA National Weather Service, from KML in Google Earth to Castor XML in the Martian Rovers.  XML parsing efficiency is important for multiple application areas; in server workloads the key focus in on overall transactions per second, while in applications for network switches and cell phones, latency and energy are of paramount importance.  Conventional software-based XML parsers have many inefficiencies including considerable branch misprediction penalties due to complex input-dependent branching structures as well as poor use of caches and memory bandwidth due to byte-at-a-time processing.  XML ASIC chips have been around since early 2003, but typically lag behind CPUs in technology due to cost constraints~\cite{xmlchip}. They also focus mainly on speeding up the parser computation itself and are limited by the poor memory behaviour. Our focus is how much we can improve performance of the XML parser on commodity processors with Parabix technology. Unfortunately, given the limited levels of parallelism that can be found in text processing, for example, XML parsing, is inherently sequential, it is not clear how this important class of application can benefit from the growth in multicore processors. As a widely cited Berkeley study~\cite{Asanovic:EECS-2006-183} reports, the thirteenth dwarf'' (parsers/finite state machines) which process text is considered to be the hardest application class to parallelize and process efficienctly.  Conventional software-based text parsers have many inefficiencies including considerable branch misprediction penalties due to complex input-dependent branching structures as well as poor use of caches and memory bandwidth due to byte-at-a-time processing. ASIC chips that process XML textual data have been around since early 2003, but typically lag behind CPUs in technology due to cost constraints~\cite{xmlchip}. They also focus mainly on speeding up the parser computation itself and are limited by the poor memory behaviour. The first generation of Parabix XML parser~\cite{CameronHerdyLin2008}, which applies a sequential bit scan method, has already shown a substantial improvement on performance. The latest version or the second generation of Parabix XML parser~\cite{Cameron2010} introduced a new idea, parallel bit scan, which provides us a more efficient scanning and better utilization of the resources. % % Introduce Parabix. % %However, classical Dennard voltage scaling has reached its limits %which gives the traditional byte-at-a-time processing methods little %space for further improvement. An alternative is to increase energy %efficiency by operating at more optimal core frequencies and achieve %better performance with a larger number of cores. %~\cite{blake-isca-2010}, in scaling our %chips~\cite{esmaeilzadeh-isca-2011} Recently, we developed a novel representation of data~\cite{Cameron2008,CameronLin2009}, Parabix (Parallel bitstreams), to aid parsers and text processing tools. Parabix transposes byte-oriented character data into parallel bit streams, where each bit represents one character from the input data. We explored the use of Parabix representation in UTF-8 to UTF-16 conversion and in specific internal passes of an XML parser~\cite{cameron-EuroPar2011}. %The first generation of Parabix XML parser~\cite{CameronHerdyLin2008}, %which applies a sequential bit scan method, has already shown a %substantial improvement on performance. The latest version or the %second generation of Parabix XML parser~\cite{Cameron2010} introduced %a new idea, parallel bit scan, which provides us a more efficient %scanning and better utilization of the resources. In this paper, We present Parabix tool chain, a novel execution framework and software runtime environment that can be used to dramatically improve the efficiency of text processing and parsing on commodity processors. In this paper, we generalize parallel bitstreams and develop the Parabix programming framework to help programmers build text processing appliations. The programmers specify the perations on unbounded character lists using bitstreams in a python environment, while our code generation and runtime translate them into low-level C++ routines.  The Parabix routines exploit the SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM) to process hundreds of character positions in an input stream simultaneously dramatically improving the execution efficiency. We describe the overall Parabix tool chain, a novel execution framework and software build environment that enables text processing applications to effectively exploit commodity multicores. We apply Parabix technology to the problem of XML parsing. Figure~\ref{perf-energy} showcases the overall efficiency of our framework. The Parabix-XML parser improves the performance %by ?$\times$ and energy efficiency %by ?$\times$ several-fold compared to widely-used software parsers, approaching the %?$cycles/input-byte$ performance of ASIC XML parsers~\cite{xmlchip,DaiNiZhu2010}. framework. The Parabix-XML parser improves the performance %by ?$\times$ and energy efficiency %by ?$\times$ several-fold compared to widely-used software parsers, approaching the %?$cycles/input-byte$ performance of ASIC XML parsers~\cite{xmlchip,DaiNiZhu2010}. The Parabix-XML parser exploits the bitstream technology to dramatically reduce branches in the parsing routines resulting in a more efficient pipeline. It also substantially improves register utilization which minimizes energy wasted on cache misses and data transfers. We make the following contributions: \footnote{The actual energy consumption of the XML ASIC chips is not published by the companies.} % Overall we make the following contributions: 1) We outline the Parabix architecture, tool chain and runtime environment and describe how it may be used to produce efficient XML parser implementations on a variety of commodity processors. While studied in the context of XML parsing, the Parabix framework can be widely applied to many problems in text processing and parsing.  We have released Parabix completely open source and are interested in exploring the applications that can take advantage of our tool chain (\textit{http://parabix.costar.sfu.ca/}). 1) We outline the Parabix architecture, code-generation tool chain and runtime environment and describe how it may be used to produce efficient XML parser implementations on a variety of commodity processors.  While studied in the context of XML parsing, the Parabix framework can be widely applied to many problems in text processing and parsing.  We have released Parabix completely open source and are interested in exploring the applications that can take advantage of our tool chain (\textit{http://parabix.costar.sfu.ca/}). 2) We compare the Parabix XML parser against conventional parsers and assess the improvement in overall performance and energy efficiency on variety of hardware platforms.  We are the first to compare and variety of hardware platforms.  We use Parabix to study and contrast SSE/AVX extensions across multiple generation of Intel processors and show that there are performance challenges when using operations. 3) Finally, building on the SIMD parallelism of Parabix technology, we multithread the Parabix XML parser to enable the different 3) Finally, we multithread the Parabix XML parser to enable the different stages in the parser to exploit SIMD units across all the cores. This further improves performance while maintaining the energy consumption constant with the sequential version. comparable with the sequential version. architecture, tool chain and runtime environment. Section~\ref{section:parser} describes the design of an XML parser based on the Parabix framework. Section~\ref{section:methodology} describes the evaluation framework and Section~\ref{section:baseline} based on the Parabix framework. Section~\ref{section:baseline} presents a detailed performance analysis of Parabix on a \CITHREE\ system using hardware performance counters. AVX technology and comments on the benefits and challenges compared to the 128-bit SSE instructions.  Finally, Section~\ref{section:multithread} looks at the multithreading of the Parabix XML parser which seeks to exploit the SIMD units scattered across multiple cores. Section~\ref{section:multithread} looks at multithreading to exploit the SIMD units scattered across multiple cores.
Note: See TracChangeset for help on using the changeset viewer.