Changeset 1326 for docs/HPCA2011/01-intro.tex

Ignore:
Timestamp:
Aug 19, 2011, 4:57:57 PM (8 years ago)
Message:

New Intro New title

File:
1 edited

Legend:

Unmodified
 r1302 \section{Introduction} Extensible Markup Language (XML) is a core technology standard of the World Wide Web Consortium (W3C) that provides a common framework for encoding and communicating structured information. In applications ranging from Office Open XML in Microsoft Office to NDFD XML of the NOAA National Weather Service, from KML in Google Earth to Castor XML in the Martian Rovers, from ebXML for e-commerce data interchange to RSS for news feeds from web sites everywhere, XML plays a ubiquitous role in providing a common framework for data interoperability world-wide and beyond. As XML 1.0 editor Tim Bray is quoted in the W3C celebration of XML at 10 years, "there is essentially no computer in the world, desk-top, hand-held, or back-room, that doesn't process XML sometimes." Classical Dennard Scaling~\cite{} which ensured that voltage scaling would enable us to keep all of transistors afforded by Moore's law active, has currently stopped. This has already resulted in a rethink of the way general-purpose processors are built: processor frequencies have remained stagnant over the last 5 years and processor cores in multIntel multicores provide capability to boost core speeds if other cores on the chip are shut-off. Chip makers strive to achieve energy efficient computing by operating at more optimal core frequencies and aiming to increase performance with larger number of cores. Unfortunately, given the levels of parallelism~\cite{blake-isca-2010} in applications, that multicores can exploit it is not certain up to how many cores we can continue scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting parallelism across multiple cores tends to require heavweight threads that are difficult to manage and synchronize. The desire to improve the overall efficiency of computing is pushing designers to explore customized hardware~\cite{venkatesh-asplos-2010, hameed-isca-2010} that accelerate specific parts of an application while reducing the overheads present in general-purpose processors. They seek to exploit the transistor bounty to provision many different accelerators and keep only the accelerators needed for an application active while switching-off others on the chip to save power consumption. While promising, given the fast evolution of languages and software, its hard to define a set of fixed-function hardware for commodity processors. Furthermore, the toolchain to create such customized hardware is itself a hard research challenge. We believe that software, applications, and runtime models themselves can be refactored to significantly improve the overall computing efficiency of commodity processors. In this paper, we demonstrate with an XML parser that changes to the underlying algorithm and compute model can significantly improve the efficiency on commodity processors. We achieve this efficiency by carefully redesigning the algorithm to exploit Parallel Bitstream runtime framework (Parabix) that exploits the SIMD extensions (SSE/AVX on x86, Neon on ARM) on commodity processors. The Parabix framework exploits modern instructions in the processor ISA that can execute 10s of operations (on multiple chararacter streams) in a single instruction and amortizes the overhead of general-purpose processor. Parabix also minimizes or eliminate branches entirely resulting in a more efficient pipeline and and improves overall register/cache utilization which minimizes energy wasted on data transfers. Parabix SSE/AVX exploits also include sophisticated instructions that enable the algorithm to pack and unpack the data elements from the registers which makes the overall cache access behavior of the application regular resulting in significantly fewer misses and better utilization. Overall as summarized by Figure~\ref{perf-energy} our Parabix-based XML parser improves the performance by ?$\times$ and energy efficiency by ?$\times$ compared to widely-used software parsers and approaching the performance of ?$cycles/input-byte$ performance of ASIC XML parsers~\cite{}.\footnote{The actual energy consumption of the XML ASIC chips is not published by the companies.} XML is a particularly interesting application; it is a standard of the web consortium that provides a common framework for encoding and communicating data.  XML provides critical data storage for applications ranging from Office Open XML in Microsoft Office to NDFD XML of the NOAA National Weather Service, from KML in Google Earth to Castor XML in the Martian Rovers, a XML data in Android phones.  XML parsing efficiency is important for multiple application areas; in server workloads the key focus in on overall transactions per second while in applications in the network switches and cell phones latency and the energy cost of parsing is of paramount importance. Software-based XML parsers are particulary inefficient and consist of giant \textit{switch-case} statements, which waste processor resources processor since they introduce input-data dependent branches. They also have poor cache efficiency since they sift forward and backward through the input-data stream trying to match the parsed tags.  XML ASIC chips have been around for over 6 years, but typically lag behind CPUs in technology due to cost constraints. Our focus is how much can we improve performance of the XML parser on commodity processors with Parabix technology. Overall we make the following contributions in this paper. 1) We develop an XML parser that demonstrates the impact of redesigning the core of an application to make more efficient use of commodity processors. We compare the Parabix-XML parser against conventional parsers and demonstrate the improvement in overall performance and energy efficiency. We also paralleillize the Parabix-XML parser to enable the different stages in the parser to exploit SIMD units across all the cores. This further improves performance while maintaining the energy consumption constant with the sequential version. 2) We are the first to compare and contrast SSE/AVX extensions across multiple generation of Intel processors and show that there are performance challenges when using newer generation SIMD extensions, possibly due to their memory interface. We compare ARM's Neon again x86's SIMD extensions and comment on the latency of SIMD operations across these architectures. 3) Finally, we introduce a runtime framework, \textit{Parabix}, that abstracts the SIMD specifics of the machine (e.g., register widths) and provides a language framework to enable applications to run efficiently on commodity processors. Parabix enables the general-purpose multicores to be used efficiently by an entirely new class of applications, text processing and parsing. \begin{comment} Figure~\ref{perf-energy} is an energy-performance scatter plot showing the results obtained. With all this XML processing, a substantial literature has arisen addressing XML processing performance in general and the performance of XML parsers in particular.   Nicola and John specifically identified XML parsing as a threat to database performance and outlined a number of potential directions for potential performance improvements \cite{NicolaJohn03}.  The nature of XML APIs was found to have a significant affect on performance with event-based SAX (Simple API for XML) parsers avoiding the tree construction costs of the more flexible DOM (Document Object Model) parsers \cite{Perkins05}.  The commercial importance of XML parsing spurred developments of hardware-based approaches including the development of a custom XML chip \cite{Leventhal2009} as well as FPGA-based implementations \cite{DaiNiZhu2010}. However promising these approaches may be for particular niche applications, it is likely that the bulk of the world's XML processing workload will be carried out on commodity processors using software-based solutions. addressing XML processing performance in general and the performance of XML parsers in particular.  Nicola and John specifically identified XML parsing as a threat to database performance and outlined a number of potential directions for potential performance improvements \cite{NicolaJohn03}.  The nature of XML APIs was found to have a significant affect on performance with event-based SAX (Simple API for XML) parsers avoiding the tree construction costs of the more flexible DOM (Document Object Model) parsers \cite{Perkins05}.  The commercial importance of XML parsing spurred developments of hardware-based approaches including the development of a custom XML chip \cite{Leventhal2009} as well as FPGA-based implementations \cite{DaiNiZhu2010}.  However promising these approaches may be for particular niche applications, it is likely that the bulk of the world's XML processing workload will be carried out on commodity processors using software-based solutions. To accelerate XML parsing performance in software, most recent benefits over traditional sequential parsing techniques that follow the byte-at-a-time model. With this focus on performance however, relatively little attention has been paid on reducing energy consumption in XML processing.  For example, in addressing performance through multicore parallelism, one generally must pay an energy price for performance gains because of the increased processing required for synchronization. This focus on reduction of energy consumption is a key topic in this paper. We study the energy and performance characteristics of several XML parsers across three generations of x86-64 processor technology.  The parsers we consider are the widely used byte-at-a-time parsers Expat and Xerces, as well the Parabix1 and Parabix2 parsers based on parallel bit stream technology. A compelling result is that the performance benefits of parallel bit stream technology translate directly and proportionally to substantial energy savings. Figure \ref{perf-energy} is an energy-performance scatter plot showing the results obtained. \end{comment} \begin{figure} The remainder of this paper is organized as follows. Section 2 presents background material on XML parsing and traditional parsing methods.  Section 3 reviews parallel bit stream technology as applied to XML parsing in the Parabix1 and Parabix2 parsers. Section 4 introduces our methodology and approach for the performance and energy study tackled in the remainder of the paper.  Section 5 presents a detailed performance evaluation on a \CITHREE\ processor as our primary evaluation platform, addressing a number of microarchitectural issues including cache misses, branch mispredictions, SIMD instruction counts and so forth.  Section 6 examines scalability and performance gains through three generations of Intel architecture culminating with a performance assessment on our two week-old \SB\ test machine. Section 7 looks specifically at issues in applying the new 256-bit AVX technology to parallel bit stream technology and notes that the major performance benefit seen so far results from the change to the non-destructive three-operand instruction format.  Section 8 concludes with a discussion of ongoing work and further research directions. %Traditional measures of performance fail to capture the impact of energy consumption \cite {bellosa2001}. %In a study done in 2007, it was estimated that in 2005, the annual operating cost\footnote{This figure only included the cost of server power consumption and cooling; %it did not account for the cost of network traffic, data storage, service and maintenance or system replacement.} of corporate servers %and data centers alone was over \$7.2 billion---with the expectation that this cost would increase to \$12.7 billion by 2010 \cite{koomey2007}. %But when it comes to power consumption, corporate costs are not the only concern: in the world of mobile devices, battery life is paramount. %While the capabilities and users' expectations of mobile devices has rapidly increased, little imp%rovement to battery technology itself is foreseen in the near future \cite{silven2007, walker2007}. Section~\ref{background} presents background material on XML parsing and provides insight into the inefficiency of traditional parsers on mainstream processors.  Section~\ref{parallel-bitstream} reviews parallel bit stream technology a framework to exploit sophisticated data parallel SIMD extensions on modern processors.  Section 5 presents a detailed performance evaluation on a \CITHREE\ processor as our primary evaluation platform, addressing a number of microarchitectural issues including cache misses, branch mispredictions, and SIMD instruction counts.  Section 6 examines scalability and performance gains through three generations of Intel architecture culminating with a performance assessment on our two week-old \SB\ test machine. We looks specifically at issues in applying the new 256-bit AVX technology to parallel bit stream technology and notes that the major performance benefit seen so far results from the change to the non-destructive three-operand instruction format. %One area in which both servers and mobile devices devote considerable