Ignore:
Timestamp:
Aug 19, 2011, 4:57:57 PM (8 years ago)
Author:
ashriram
Message:

New Intro New title

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2011/01-intro.tex

    r1302 r1326  
    11\section{Introduction}
    2 
    3 Extensible Markup Language (XML) is a core technology standard
    4 of the World Wide Web Consortium (W3C) that provides a common
    5 framework for encoding and communicating structured information. 
    6 In applications ranging from Office Open XML in
    7 Microsoft Office to NDFD XML of the NOAA National Weather
    8 Service, from KML in Google Earth to Castor XML in the Martian Rovers,
    9 from ebXML for e-commerce data interchange to RSS for news feeds
    10 from web sites everywhere, XML plays a ubiquitous role in providing
    11 a common framework for data interoperability world-wide and beyond.
    12 As XML 1.0 editor Tim Bray is quoted in the W3C celebration of XML at 10 years,
    13 "there is essentially no computer in the world, desk-top, hand-held,
    14 or back-room, that doesn't process XML sometimes."
     2Classical Dennard Scaling~\cite{} which ensured that voltage scaling
     3would enable us to keep all of transistors afforded by Moore's law
     4active, has currently stopped. This has already resulted in a rethink
     5of the way general-purpose processors are built: processor frequencies
     6have remained stagnant over the last 5 years and processor cores in
     7multIntel multicores provide capability to boost core speeds if other
     8cores on the chip are shut-off. Chip makers strive to achieve energy
     9efficient computing by operating at more optimal core frequencies and
     10aiming to increase performance with larger number of
     11cores. Unfortunately, given the levels of
     12parallelism~\cite{blake-isca-2010} in applications, that multicores
     13can exploit it is not certain up to how many cores we can continue
     14scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because
     15exploiting parallelism across multiple cores tends to require
     16heavweight threads that are difficult to manage and synchronize.
     17
     18
     19The desire to improve the overall efficiency of computing is pushing
     20designers to explore customized hardware~\cite{venkatesh-asplos-2010,
     21  hameed-isca-2010} that accelerate specific parts of an application
     22while reducing the overheads present in general-purpose
     23processors. They seek to exploit the transistor bounty to provision
     24many different accelerators and keep only the accelerators needed for
     25an application active while switching-off others on the chip to save
     26power consumption. While promising, given the fast evolution of
     27languages and software, its hard to define a set of fixed-function
     28hardware for commodity processors. Furthermore, the toolchain to
     29create such customized hardware is itself a hard research
     30challenge. We believe that software, applications, and runtime models
     31themselves can be refactored to significantly improve the overall
     32computing efficiency of commodity processors.
     33
     34
     35In this paper, we demonstrate with an XML parser that changes to the
     36underlying algorithm and compute model can significantly improve the
     37efficiency on commodity processors. We achieve this efficiency by
     38carefully redesigning the algorithm to exploit Parallel Bitstream
     39runtime framework (Parabix) that exploits the SIMD extensions (SSE/AVX
     40on x86, Neon on ARM) on commodity processors. The Parabix framework
     41exploits modern instructions in the processor ISA that can execute 10s
     42of operations (on multiple chararacter streams) in a single
     43instruction and amortizes the overhead of general-purpose
     44processor. Parabix also minimizes or eliminate branches entirely
     45resulting in a more efficient pipeline and and improves overall
     46register/cache utilization which minimizes energy wasted on data
     47transfers. Parabix SSE/AVX exploits also include sophisticated
     48instructions that enable the algorithm to pack and unpack the data
     49elements from the registers which makes the overall cache access
     50behavior of the application regular resulting in significantly fewer
     51misses and better utilization. Overall as summarized by
     52Figure~\ref{perf-energy} our Parabix-based XML parser improves the
     53performance by ?$\times$ and energy efficiency by ?$\times$ compared
     54to widely-used software parsers and approaching the performance of
     55?$cycles/input-byte$ performance of ASIC XML
     56parsers~\cite{}.\footnote{The actual energy consumption of the XML
     57  ASIC chips is not published by the companies.}
     58
     59
     60XML is a particularly interesting application; it is a standard of the
     61web consortium that provides a common framework for encoding and
     62communicating data.  XML provides critical data storage for
     63applications ranging from Office Open XML in Microsoft Office to NDFD
     64XML of the NOAA National Weather Service, from KML in Google Earth to
     65Castor XML in the Martian Rovers, a XML data in Android phones.  XML
     66parsing efficiency is important for multiple application areas; in
     67server workloads the key focus in on overall transactions per second
     68while in applications in the network switches and cell phones latency
     69and the energy cost of parsing is of paramount
     70importance. Software-based XML parsers are particulary inefficient and
     71consist of giant \textit{switch-case} statements, which waste
     72processor resources processor since they introduce input-data
     73dependent branches. They also have poor cache efficiency since they
     74sift forward and backward through the input-data stream trying to
     75match the parsed tags.  XML ASIC chips have been around for over 6
     76years, but typically lag behind CPUs in technology due to cost
     77constraints. Our focus is how much can we improve performance of the
     78XML parser on commodity processors with Parabix technology.
     79
     80Overall we make the following contributions in this paper.
     81
     821) We develop an XML parser that demonstrates the impact of
     83redesigning the core of an application to make more efficient use of
     84commodity processors. We compare the Parabix-XML parser against
     85conventional parsers and demonstrate the improvement in overall
     86performance and energy efficiency. We also paralleillize the
     87Parabix-XML parser to enable the different stages in the parser to
     88exploit SIMD units across all the cores. This further improves
     89performance while maintaining the energy consumption constant with the
     90sequential version.
     91
     922) We are the first to compare and contrast SSE/AVX extensions across
     93multiple generation of Intel processors and show that there are
     94performance challenges when using newer generation SIMD extensions,
     95possibly due to their memory interface. We compare ARM's Neon again
     96x86's SIMD extensions and comment on the latency of SIMD operations
     97across these architectures.
     98
     993) Finally, we introduce a runtime framework, \textit{Parabix}, that
     100abstracts the SIMD specifics of the machine (e.g., register widths)
     101and provides a language framework to enable applications to run
     102efficiently on commodity processors. Parabix enables the
     103general-purpose multicores to be used efficiently by an entirely new
     104class of applications, text processing and parsing.
     105
     106
     107
     108
     109
     110
     111\begin{comment}
     112Figure~\ref{perf-energy} is an energy-performance scatter plot showing
     113the results obtained.
     114
    15115
    16116With all this XML processing, a substantial literature has arisen
    17 addressing XML processing performance in general and the
    18 performance of XML parsers in particular.   Nicola and John
    19 specifically identified XML parsing as a threat to database
    20  performance and outlined a number of potential directions for potential
    21 performance improvements \cite{NicolaJohn03}.  The nature of XML
    22 APIs was found to have a significant affect on performance with
    23 event-based SAX (Simple API for XML) parsers avoiding the tree
    24 construction costs of the more flexible DOM (Document Object
    25 Model) parsers \cite{Perkins05}.  The commercial importance
    26 of XML parsing spurred developments of hardware-based approaches
    27 including the development of a custom XML chip \cite{Leventhal2009}
    28 as well as FPGA-based implementations \cite{DaiNiZhu2010}.
    29 However promising these approaches may be for particular niche applications,
    30 it is likely that the bulk of the world's XML
    31 processing workload will be carried out on commodity processors
    32 using software-based solutions.
     117addressing XML processing performance in general and the performance
     118of XML parsers in particular.  Nicola and John specifically identified
     119XML parsing as a threat to database performance and outlined a number
     120of potential directions for potential performance improvements
     121\cite{NicolaJohn03}.  The nature of XML APIs was found to have a
     122significant affect on performance with event-based SAX (Simple API for
     123XML) parsers avoiding the tree construction costs of the more flexible
     124DOM (Document Object Model) parsers \cite{Perkins05}.  The commercial
     125importance of XML parsing spurred developments of hardware-based
     126approaches including the development of a custom XML chip
     127\cite{Leventhal2009} as well as FPGA-based implementations
     128\cite{DaiNiZhu2010}.  However promising these approaches may be for
     129particular niche applications, it is likely that the bulk of the
     130world's XML processing workload will be carried out on commodity
     131processors using software-based solutions.
    33132
    34133To accelerate XML parsing performance in software, most recent
     
    43142benefits over traditional sequential parsing techniques that follow the
    44143byte-at-a-time model.
    45 
    46 With this focus on performance however, relatively little attention
    47 has been paid on reducing energy consumption in XML processing.  For example, in addressing
    48 performance through multicore parallelism, one generally must
    49 pay an energy price for performance gains because of the
    50 increased processing required for synchronization.   
    51 This focus on reduction of energy consumption is a key topic in this
    52 paper. We study the energy and performance
    53 characteristics of several XML parsers across three generations
    54 of x86-64 processor technology.  The parsers we consider are
    55 the widely used byte-at-a-time parsers Expat and Xerces, as well the
    56 Parabix1 and Parabix2 parsers based on parallel bit stream technology. 
    57 A compelling result is that the performance benefits of parallel bit stream technology
    58 translate directly and proportionally to substantial energy savings.
    59 Figure \ref{perf-energy} is an energy-performance scatter plot
    60 showing the results obtained.
     144\end{comment}
     145
     146
    61147
    62148\begin{figure}
     
    69155
    70156The remainder of this paper is organized as follows.
    71 Section 2 presents background material on XML parsing
    72 and traditional parsing methods.  Section 3 reviews
    73 parallel bit stream technology as applied to
    74 XML parsing in the Parabix1 and Parabix2 parsers.
    75 Section 4 introduces our methodology and approach
    76 for the performance and energy study tackled in the
    77 remainder of the paper.  Section 5 presents a
    78 detailed performance evaluation on a \CITHREE\ processor
    79 as our primary evaluation platform, addressing a
    80 number of microarchitectural issues including cache
    81 misses, branch mispredictions, SIMD instruction counts
    82 and so forth.  Section 6 examines scalability and
    83 performance gains through three generations of Intel
    84 architecture culminating with a performance assessment
    85 on our two week-old \SB\ test machine.
    86 Section 7 looks specifically at issues in applying
    87 the new 256-bit AVX technology to parallel bit stream
    88 technology and notes that the major performance benefit
    89 seen so far results from the change to the non-destructive three-operand
    90 instruction format.  Section 8 concludes with
    91 a discussion of ongoing work and further research directions.
    92 
    93 
    94 %Traditional measures of performance fail to capture the impact of energy consumption \cite {bellosa2001}.
    95 %In a study done in 2007, it was estimated that in 2005, the annual operating cost\footnote{This figure only included the cost of server power consumption and cooling;
    96 %it did not account for the cost of network traffic, data storage, service and maintenance or system replacement.} of corporate servers
    97 %and data centers alone was over \$7.2 billion---with the expectation that this cost would increase to \$12.7 billion by 2010 \cite{koomey2007}.
    98 %But when it comes to power consumption, corporate costs are not the only concern: in the world of mobile devices, battery life is paramount.
    99 %While the capabilities and users' expectations of mobile devices has rapidly increased, little imp%rovement to battery technology itself is foreseen in the near future \cite{silven2007, walker2007}.
     157Section~\ref{background} presents background material on XML parsing
     158and provides insight into the inefficiency of traditional parsers on
     159mainstream processors.  Section~\ref{parallel-bitstream} reviews
     160parallel bit stream technology a framework to exploit sophisticated
     161data parallel SIMD extensions on modern processors.  Section 5
     162presents a detailed performance evaluation on a \CITHREE\ processor as
     163our primary evaluation platform, addressing a number of
     164microarchitectural issues including cache misses, branch
     165mispredictions, and SIMD instruction counts.  Section 6 examines
     166scalability and performance gains through three generations of Intel
     167architecture culminating with a performance assessment on our two
     168week-old \SB\ test machine. We looks specifically at issues in
     169applying the new 256-bit AVX technology to parallel bit stream
     170technology and notes that the major performance benefit seen so far
     171results from the change to the non-destructive three-operand
     172instruction format.
     173
     174
     175
    100176
    101177%One area in which both servers and mobile devices devote considerable
Note: See TracChangeset for help on using the changeset viewer.