Ignore:
Timestamp:
Aug 23, 2011, 1:02:30 AM (8 years ago)
Author:
ashriram
Message:

new abstract for new intro

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/01-intro.tex

    r1342 r1348  
    11\section{Introduction}
    2 We have now long since reached the limit to classical Dennard voltage scaling~\cite{},
    3 that enabled us to keep all of transistors afforded by Moore's law
    4 active. This has already resulted in a rethink
    5 of the way general-purpose processors are built: processor frequencies
    6 have remained stagnant over the last 5 years with the capability to
    7 boost core speeds on Intel multicores only if other
    8 cores on the chip are shut off. Chip makers strive to achieve energy
    9 efficient computing by operating at more optimal core frequencies and
    10 aim to increase performance with larger number of
    11 cores. Unfortunately, given the limited levels of
    12 parallelism that can be found in applications~\cite{blake-isca-2010},
    13 it is not certain how many cores can be productively used in
    14 scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because
    15 exploiting parallelism across multiple cores tends to require
    16 heavweight threads that are difficult to manage and synchronize.
     2We have now long since reached the limit to classical Dennard voltage
     3scaling that enabled us to keep all of transistors afforded by
     4Moore's law active. This has already resulted in a rethink of the way
     5general-purpose processors are built: processor frequencies have
     6remained stagnant over the last 5 years with the capability to boost
     7core speeds on Intel multicores only if other cores on the chip are
     8shut off. Chip makers strive to achieve energy efficient computing by
     9operating at more optimal core frequencies and aim to increase
     10performance with larger number of cores. Unfortunately, given the
     11limited levels of parallelism that can be found in
     12applications~\cite{blake-isca-2010}, it is not certain how many cores
     13can be productively used in scaling our
     14chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting
     15parallelism across multiple cores tends to require heavweight threads
     16that are difficult to manage and synchronize.
    1717
    1818The desire to improve the overall efficiency of computing is pushing
     
    3131computing efficiency of commodity processors.
    3232
    33 In this paper, we tackle the infamous ``thirteenth dwarf'' (parsers/finite
    34 state machines) that is considered to be the hardest application
    35 class to parallelize~\cite{Asanovic:EECS-2006-183} and show how Parabix,
    36 a novel software architecture, tool chain and run-time environment
    37 can indeed be used to dramatically improve parsing efficiency on
    38 commodity processors.   
    39 Based on the concept of transposing
    40 byte-oriented character data into parallel bit streams for the
    41 individual bits of each byte, the Parabix framework exploits the SIMD
    42 extensions (SSE/AVX on x86, Neon on ARM) on commodity processors
    43 to process hundreds of character positions in an input
    44 stream simultaneously.  To achieve transposition, Parabix exploits
    45 sophisticated SIMD instructions that enable data elements to be packed and
    46 unpacked from registers in a regular manner which improve the overall cache access
    47 behavior of the application resulting in significantly fewer
    48 misses and better utilization.
    49 Parabix also dramatically reduces branches
    50 in parsing code resulting in a more efficient pipeline and substantially
    51 improves register/cache utilization which minimizes energy wasted on data
    52 transfers.   
    53 
    54 We study Parabix technology in application to the problem of XML parsing
    55 and develop several implementations for different computing platforms.
    56 XML is a particularly interesting application; it is a standard of the
    57 web consortium that provides a common framework for encoding and
     33In this paper, we tackle the infamous ``thirteenth dwarf''
     34(parsers/finite state machines) that is considered to be the hardest
     35application class to parallelize~\cite{Asanovic:EECS-2006-183} and
     36show how Parabix, a novel software architecture, tool chain and
     37run-time environment can indeed be used to dramatically improve
     38parsing efficiency on commodity processors.  Based on the concept of
     39transposing byte-oriented character data into parallel bit streams for
     40the individual bits of each byte, the Parabix framework exploits the
     41SIMD extensions (SSE/AVX on x86, Neon on ARM) on commodity processors
     42to process hundreds of character positions in an input stream
     43simultaneously.  To achieve transposition, Parabix exploits
     44sophisticated SIMD instructions that enable data elements to be packed
     45and unpacked from registers in a regular manner which improve the
     46overall cache access behavior of the application resulting in
     47significantly fewer misses and better utilization.  Parabix also
     48dramatically reduces branches in parsing code resulting in a more
     49efficient pipeline and substantially improves register/cache
     50utilization which minimizes energy wasted on data transfers.
     51
     52We apply Parabix technology to the problem of XML parsing and develop
     53several implementations for different computing platforms.  XML is a
     54particularly interesting application; it is a standard of the web
     55consortium that provides a common framework for encoding and
    5856communicating data.  XML provides critical data storage for
    5957applications ranging from Office Open XML in Microsoft Office to NDFD
     
    6159Castor XML in the Martian Rovers, a XML data in Android phones.  XML
    6260parsing efficiency is important for multiple application areas; in
    63 server workloads the key focus in on overall transactions per second
    64 while in applications in the network switches and cell phones latency
    65 and the energy cost of parsing is of paramount
    66 importance.   Traditional software-based XML parsers have many
    67 inefficiencies due to complex input-dependent branching structures
    68 leading to considerable branch misprediction penalties as well
    69 as poor use of memory bandwidth and data caches due to byte-at-a-time
    70 processing and multiple buffering.  XML ASIC chips have been around for over 6
    71 years, but typically lag behind CPUs in technology due to cost
    72 constraints. Our focus is how much can we improve performance of the
    73 XML parser on commodity processors with Parabix technology.
     61server workloads the key focus in on overall transactions per second,
     62while in applications in the network switches and cell phones, latency
     63and the energy are of paramount importance.  Traditional
     64software-based XML parsers have many inefficiencies due to complex
     65input-dependent branching structures leading to considerable branch
     66misprediction penalties as well as poor use of memory bandwidth and
     67data caches due to byte-at-a-time processing and multiple buffering.
     68XML ASIC chips have been around for over 6 years, but typically lag
     69behind CPUs in technology due to cost constraints. Our focus is how
     70much can we improve performance of the XML parser on commodity
     71processors with Parabix technology.
    7472
    7573In the end, as summarized by
Note: See TracChangeset for help on using the changeset viewer.