Ignore:
Timestamp:
Aug 30, 2011, 10:47:59 AM (8 years ago)
Author:
ashriram
Message:

Minor bug fixes up to 04

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/01-intro.tex

    r1373 r1393  
    11\section{Introduction}
    22We have now long since reached the limit to classical Dennard voltage
    3 scaling that enabled us to keep all of transistors afforded by
    4 Moore's law active. This has already resulted in a rethink of the way
     3scaling that enabled us to keep all of transistors afforded by Moore's
     4law active. This has already resulted in a rethink of the way
    55general-purpose processors are built: processor frequencies have
    66remained stagnant over the last 5 years with the capability to boost
    7 core speeds on Intel multicores only if other cores on the chip are
    8 shut off. Chip makers strive to achieve energy efficient computing by
    9 operating at more optimal core frequencies and aim to increase
    10 performance with a larger number of cores. Unfortunately, given the
    11 limited levels of parallelism that can be found in
    12 applications~\cite{blake-isca-2010}, it is not certain how many cores
    13 can be productively used in scaling our
    14 chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting
    15 parallelism across multiple cores tends to require heavy weight threads
    16 that are difficult to manage and synchronize.
     7a core's frequency only if other cores on the chip are shut off. Chip makers
     8strive to achieve energy efficient computing by operating at more
     9optimal core frequencies and aim to increase performance with a larger
     10number of cores. Unfortunately, given the limited levels of
     11parallelism that can be found in applications~\cite{blake-isca-2010},
     12it is not certain how many cores can be productively used in scaling
     13our chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting
     14parallelism across multiple cores tends to require heavy weight
     15threads that are difficult to manage and synchronize.
    1716
    1817The desire to improve the overall efficiency of computing is pushing
     
    3332In this paper, we tackle the infamous ``thirteenth dwarf''
    3433(parsers/finite state machines) that is considered to be the hardest
    35 application class to parallelize~\cite{Asanovic:EECS-2006-183} and
    36 show how Parabix, a novel software architecture, tool chain and
    37 run-time environment can indeed be used to dramatically improve
    38 parsing efficiency on commodity processors.  Based on the concept of
    39 transposing byte-oriented character data into parallel bit streams for
    40 the individual bits of each byte, the Parabix framework exploits the
    41 SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM) 
     34application class to parallelize~\cite{Asanovic:EECS-2006-183}. We
     35present Parabix, a novel execution framework and software run-time
     36environment that can be used to dramatically improve the efficiency of
     37text processing and parsing on commodity processors.  Parabix
     38transposes byte-oriented character data into parallel bit streams for
     39the individual bits of each character byte and then exploits the
     40SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM)
    4241to process hundreds of character positions in an input stream
    4342simultaneously.  To achieve transposition, Parabix exploits
     
    5756applications ranging from Office Open XML in Microsoft Office to NDFD
    5857XML of the NOAA National Weather Service, from KML in Google Earth to
    59 Castor XML in the Martian Rovers, as well as ubiquitous XML data in Android phones.  XML
    60 parsing efficiency is important for multiple application areas; in
    61 server workloads the key focus in on overall transactions per second,
    62 while in applications in network switches and cell phones, latency
    63 and energy are of paramount importance.  Traditional
    64 software-based XML parsers have many inefficiencies including
    65 considerable branch misprediction penalties due to complex
    66 input-dependent branching structures as well as poor use of memory bandwidth and
    67 data caches due to byte-at-a-time processing and multiple buffering.
    68 XML ASIC chips have been around for over 6 years, but typically lag
    69 behind CPUs in technology due to cost constraints. Our focus is how
    70 much we can improve performance of the XML parser on commodity
    71 processors with Parabix technology.
     58Castor XML in the Martian Rovers.  XML parsing efficiency is important
     59for multiple application areas; in server workloads the key focus in
     60on overall transactions per second, while in applications in network
     61switches and cell phones, latency and energy are of paramount
     62importance.  Traditional software-based XML parsers have many
     63inefficiencies including considerable branch misprediction penalties
     64due to complex input-dependent branching structures as well as poor
     65use of memory bandwidth and data caches due to byte-at-a-time
     66processing and multiple buffering.  XML ASIC chips have been around
     67for over 6 years, but typically lag behind CPUs in technology due to
     68cost constraints. Our focus is how much we can improve performance of
     69the XML parser on commodity processors with Parabix technology.
    7270
    7371In the end, as summarized by
     
    9290parsing.
    9391
    94 2) We compare  Parabix XML parsers against conventional parsers
    95 and assess the improvement in overall performance and energy efficiency
    96 on each platform.   We are the first to compare and contrast SSE/AVX extensions across
    97 multiple generation of Intel processors and show that there are
    98 performance challenges when using newer generation SIMD extensions,
    99 possibly due to their memory interface. We compare the ARM Neon
    100 extensions against the x86 SIMD extensions and comment on the latency of
    101 SIMD operations across these architectures.
     922) We compare Parabix XML parsers against conventional parsers and
     93assess the improvement in overall performance and energy efficiency on
     94each platform.  We are the first to compare and contrast SSE/AVX
     95extensions across multiple generation of Intel processors and show
     96that there are performance challenges when using newer generation SIMD
     97extensions. We compare the ARM Neon extensions against the x86 SIMD
     98extensions and comment on the latency of SIMD operations across these
     99architectures.
    102100
    1031013) Finally, building on the SIMD parallelism of Parabix technology,
     
    154152
    155153The remainder of this paper is organized as follows.
    156 Section~\ref{section:background} presents background material on XML parsing
    157 and provides insight into the inefficiency of traditional parsers on
    158 mainstream processors.  Section~\ref{section:parabix} describes the
    159 Parabix architecture, tool chain and run-time environment.
    160 Section~\ref{section:parser} describes the application of the
    161 Parabix framework to the construction of an XML parser
    162 enforcing all the well-formedness rules of the XML
    163 specification.  Section~\ref{section:methodology} then describes
    164 the overall methodology of our performance and energy study.
    165 Section~\ref{section:baseline} presents a detailed
    166 performance evaluation on a \CITHREE\ processor as
    167 our primary evaluation platform, addressing a number of
    168 microarchitectural issues including cache misses, branch
    169 mispredictions, and SIMD instruction counts.  Section~\ref{section:scalability} examines
    170 scalability and performance gains through three generations of Intel
    171 architecture.  Section~\ref{section:avx} examines the extension
    172 of the Parabix technology to take advantage of Intel's new
    173 256-bit AVX technology, while Section~\ref{section:neon} investigates
    174 the applications of this technology on mobile platforms using
    175 ARM processors with Neon SIMD extensions.
    176 Section~\ref{section:multithread} then looks at the multithreading of the
    177 Parabix XML parser using pipeline parallelism.
    178 Section~\ref{section:related} discusses related work, after which
    179 Section~\ref{section:conclusion} concludes the paper.
     154Section~\ref{section:background} presents background material on XML
     155parsing and provides insight into the inefficiency of traditional
     156parsers on mainstream processors.  Section~\ref{section:parabix}
     157describes the Parabix architecture, tool chain and run-time
     158environment.  Section~\ref{section:parser} describes the application
     159of the Parabix framework to the construction of an XML parser
     160enforcing all the well-formedness rules of the XML specification.
     161Section~\ref{section:baseline} presents a detailed performance
     162analysis of Parabix on a \CITHREE\ system using hardware performance
     163counters and compares it against conventional parsers.
     164Section~\ref{section:scalability} compares the performance and energy
     165efficiency of 128 bit SIMD extensions across three generations of
     166intel processors and includes a comparison with the ARM Cortex-A8
     167processor.  Section~\ref{section:avx} examines the Intel's new 256-bit
     168AVX technology and comments on the benefits and challenges compared to
     169the 128-bit SSE instructions.  Finally,
     170Section~\ref{section:multithread} looks at the multithreading of the
     171Parabix XML parser which seeks to exploit the SIMD units scattered
     172across multiple cores.
    180173
    181174
Note: See TracChangeset for help on using the changeset viewer.