Changeset 1330


Ignore:
Timestamp:
Aug 20, 2011, 5:12:13 PM (8 years ago)
Author:
cameron
Message:

Revise intro

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/01-intro.tex

    r1326 r1330  
    11\section{Introduction}
    2 Classical Dennard Scaling~\cite{} which ensured that voltage scaling
    3 would enable us to keep all of transistors afforded by Moore's law
    4 active, has currently stopped. This has already resulted in a rethink
     2We have now long since reached the limit to classical Dennard voltage scaling~\cite{},
     3that enabled us to keep all of transistors afforded by Moore's law
     4active. This has already resulted in a rethink
    55of the way general-purpose processors are built: processor frequencies
    6 have remained stagnant over the last 5 years and processor cores in
    7 multIntel multicores provide capability to boost core speeds if other
    8 cores on the chip are shut-off. Chip makers strive to achieve energy
     6have remained stagnant over the last 5 years with the capability to
     7boost core speeds on Intel multicores only if other
     8cores on the chip are shut off. Chip makers strive to achieve energy
    99efficient computing by operating at more optimal core frequencies and
    10 aiming to increase performance with larger number of
    11 cores. Unfortunately, given the levels of
    12 parallelism~\cite{blake-isca-2010} in applications, that multicores
    13 can exploit it is not certain up to how many cores we can continue
     10aim to increase performance with larger number of
     11cores. Unfortunately, given the limited levels of
     12parallelism that can be found in applications~\cite{blake-isca-2010},
     13it is not certain how many cores can be productively used in
    1414scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because
    1515exploiting parallelism across multiple cores tends to require
    1616heavweight threads that are difficult to manage and synchronize.
    17 
    1817
    1918The desire to improve the overall efficiency of computing is pushing
     
    2322processors. They seek to exploit the transistor bounty to provision
    2423many different accelerators and keep only the accelerators needed for
    25 an application active while switching-off others on the chip to save
     24an application active while switching off others on the chip to save
    2625power consumption. While promising, given the fast evolution of
    2726languages and software, its hard to define a set of fixed-function
     
    3231computing efficiency of commodity processors.
    3332
    34 
    35 In this paper, we demonstrate with an XML parser that changes to the
    36 underlying algorithm and compute model can significantly improve the
    37 efficiency on commodity processors. We achieve this efficiency by
    38 carefully redesigning the algorithm to exploit Parallel Bitstream
    39 runtime framework (Parabix) that exploits the SIMD extensions (SSE/AVX
    40 on x86, Neon on ARM) on commodity processors. The Parabix framework
    41 exploits modern instructions in the processor ISA that can execute 10s
    42 of operations (on multiple chararacter streams) in a single
    43 instruction and amortizes the overhead of general-purpose
    44 processor. Parabix also minimizes or eliminate branches entirely
    45 resulting in a more efficient pipeline and and improves overall
    46 register/cache utilization which minimizes energy wasted on data
    47 transfers. Parabix SSE/AVX exploits also include sophisticated
    48 instructions that enable the algorithm to pack and unpack the data
    49 elements from the registers which makes the overall cache access
    50 behavior of the application regular resulting in significantly fewer
    51 misses and better utilization. Overall as summarized by
    52 Figure~\ref{perf-energy} our Parabix-based XML parser improves the
    53 performance by ?$\times$ and energy efficiency by ?$\times$ compared
    54 to widely-used software parsers and approaching the performance of
    55 ?$cycles/input-byte$ performance of ASIC XML
    56 parsers~\cite{}.\footnote{The actual energy consumption of the XML
    57   ASIC chips is not published by the companies.}
     33In this paper, we tackle the infamous ``thirteenth dwarf'' (parsers/finite
     34state machines) that is considered to be the hardest application
     35class to parallelize~\cite{Asanovic:EECS-2006-183} and show how Parabix,
     36a novel software architecture, tool chain and run-time environment
     37can indeed be used to dramatically improve parsing efficiency on
     38commodity processors.   
     39Based on the concept of transposing
     40byte-oriented character data into parallel bit streams for the
     41individual bits of each byte, the Parabix framework exploits the SIMD
     42extensions (SSE/AVX on x86, Neon on ARM) on commodity processors
     43to process hundreds of character positions in an input
     44stream simultaneously.  To achieve transposition, Parabix exploits
     45sophisticated SIMD instructions that enable data elements to be packed and
     46unpacked from registers in a regular manner which improve the overall cache access
     47behavior of the application resulting in significantly fewer
     48misses and better utilization.
     49Parabix also dramatically reduces branches
     50in parsing code resulting in a more efficient pipeline and substantially
     51improves register/cache utilization which minimizes energy wasted on data
     52transfers.   
     53
     54We study Parabix technology in application to the problem of XML parsing
     55and develop several implementations for different computing platforms.
    5856
    5957
     
    7876XML parser on commodity processors with Parabix technology.
    7977
     78In the end, as summarized by
     79Figure~\ref{perf-energy} our Parabix-based XML parser improves the
     80performance by ?$\times$ and energy efficiency by ?$\times$ compared
     81to widely-used software parsers and approaching the performance of
     82?$cycles/input-byte$ performance of ASIC XML
     83parsers~\cite{}.\footnote{The actual energy consumption of the XML
     84  ASIC chips is not published by the companies.}
     85
    8086Overall we make the following contributions in this paper.
    8187
    82 1) We develop an XML parser that demonstrates the impact of
    83 redesigning the core of an application to make more efficient use of
    84 commodity processors. We compare the Parabix-XML parser against
    85 conventional parsers and demonstrate the improvement in overall
    86 performance and energy efficiency. We also paralleillize the
    87 Parabix-XML parser to enable the different stages in the parser to
    88 exploit SIMD units across all the cores. This further improves
    89 performance while maintaining the energy consumption constant with the
    90 sequential version.
    91 
    92 2) We are the first to compare and contrast SSE/AVX extensions across
     881) We introduce the Parabix architecture, tool chain and run-time
     89environment and describe how it may be used to produce efficient
     90XML parser implementations on a variety of commodity processors.
     91While studied in the context of XML parsing, the Parabix framework
     92can be widely applied to many problems in text processing and
     93parsing.
     94
     952) We compare our Parabix XML parsers against conventional parsers
     96and assess the improvement in overall performance and energy efficiency
     97on each platform.   We are the first to compare and contrast SSE/AVX extensions across
    9398multiple generation of Intel processors and show that there are
    9499performance challenges when using newer generation SIMD extensions,
    95 possibly due to their memory interface. We compare ARM's Neon again
    96 x86's SIMD extensions and comment on the latency of SIMD operations
    97 across these architectures.
    98 
    99 3) Finally, we introduce a runtime framework, \textit{Parabix}, that
    100 abstracts the SIMD specifics of the machine (e.g., register widths)
    101 and provides a language framework to enable applications to run
    102 efficiently on commodity processors. Parabix enables the
    103 general-purpose multicores to be used efficiently by an entirely new
    104 class of applications, text processing and parsing.
    105 
    106 
    107 
    108 
     100possibly due to their memory interface. We compare the ARM Neon
     101extensions against the x86 SIMD extensions and comment on the latency of
     102SIMD operations across these architectures.
     103
     1043) Finally, building on the SIMD parallelism of Parabix technology,
     105we multithread the Parabix XML parser to to enable the different
     106stages in the parser to exploit SIMD units across all the cores.
     107This further improves performance while maintaining the energy consumption
     108constant with the sequential version.
    109109
    110110
Note: See TracChangeset for help on using the changeset viewer.