Ignore:
Timestamp:
Nov 30, 2011, 11:30:44 AM (8 years ago)
Author:
ashriram
Message:

First pass final version [ashriram]

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/final_ieee/01-intro.tex

    r1733 r1743  
    11\section{Introduction}
     2Modern applications ranging from web search to analytics are mainly
     3data centric operating large swathes of data. Information expansion
     4and diversification of data has resulted in multiple textual storage
     5formats. XML is a widely-used text-based data storage format. XML is a
     6standard of the web consortium that provides a common framework for
     7encoding and communicating data.  It is used in applications ranging
     8from Office Open XML in Microsoft Office to NDFD XML of the NOAA
     9National Weather Service, from KML in Google Earth to Castor XML in
     10the Martian Rovers. To enable these diverse applications we need high
     11performance, scalable, and energy efficient text processing stored in
     12these XML documents.
    213
    3 As a result of information expansion and diversification of the data format,
    4 the demands of high performance and energy efficient text processing are rapidly increasing.
    5 However, classical Dennard voltage scaling has reached its limits
    6 which gives the traditional byte-at-a-time processing methods little space
    7 for further improvement. An alternative is to increase energy efficiency
    8 by operating at more optimal core frequencies and achieve better performance
    9 with a larger number of cores. Unfortunately, given the limited levels of parallelism
    10 that can be found in applications~\cite{blake-isca-2010}, especailly text processing,
    11 in which, many applications, for example, XML parsing, are sequential by nature,
    12 it is not certain how many cores can be productively used in scaling our
    13 chips~\cite{esmaeilzadeh-isca-2011}. In a widely cited Berkeley study~\cite{Asanovic:EECS-2006-183},
    14 the infamous ``thirteenth dwarf'' (parsers/finite state machines) is considered to be the hardest
    15 application class to parallelize.
     14%; in server
     15%workloads the key focus in on overall transactions per second, while
     16%in applications for network switches and cell phones, latency and
     17%energy are of paramount importance.
    1618
    17 A new technology, Parabix, was introduced to exploit the SIMD extensions on commodity processors
    18 to process hundreds of character positions in an input stream simultaneously~\cite{Cameron2008}.
    19 Parabix first transposes byte-oriented character data into parallel bit streams
    20 using sophisticated SIMD instructions that enable data elements to be packed into registers.
    21 With the bit streams, where each bit represents one character from the input data, the text can then
    22 be processed in parallel within the SIMD registers.
    23 This improves the overall cache behaviour of the application resulting in significantly
    24 fewer misses and better utilization.  Parabix also dramatically
    25 reduces branches in the parsing routines resulting in a more efficient
    26 pipeline and substantially improves register utilization which
    27 minimizes energy wasted on data transfers.
    2819
    29 We apply Parabix technology to the problem of XML parsing.  XML is a
    30 standard of the web consortium that provides a common framework for
    31 encoding and communicating data.  XML provides critical data storage
    32 for applications ranging from Office Open XML in Microsoft Office to
    33 NDFD XML of the NOAA National Weather Service, from KML in Google
    34 Earth to Castor XML in the Martian Rovers.  XML parsing efficiency is
    35 important for multiple application areas; in server workloads the key
    36 focus in on overall transactions per second, while in applications for
    37 network switches and cell phones, latency and energy are of paramount
    38 importance.  Conventional software-based XML parsers have many
    39 inefficiencies including considerable branch misprediction penalties
    40 due to complex input-dependent branching structures as well as poor
    41 use of caches and memory bandwidth due to byte-at-a-time
    42 processing.  XML ASIC chips have been around since early 2003, but typically lag behind CPUs in technology due to
    43 cost constraints~\cite{xmlchip}. They also focus mainly on speeding up the parser
    44 computation itself and are limited by the poor memory behaviour.
    45 Our focus is how much we can improve performance of
    46 the XML parser on commodity processors with Parabix technology.
     20Unfortunately, given the limited levels of parallelism that can be
     21found in text processing, for example, XML parsing, is inherently
     22sequential, it is not clear how this important class of application
     23can benefit from the growth in multicore processors. As a widely cited
     24Berkeley study~\cite{Asanovic:EECS-2006-183} reports, the ``thirteenth
     25dwarf'' (parsers/finite state machines) which process text is
     26considered to be the hardest application class to parallelize and
     27process efficienctly.  Conventional software-based text parsers have
     28many inefficiencies including considerable branch misprediction
     29penalties due to complex input-dependent branching structures as well
     30as poor use of caches and memory bandwidth due to byte-at-a-time
     31processing. ASIC chips that process XML textual data have been around
     32since early 2003, but typically lag behind CPUs in technology due to
     33cost constraints~\cite{xmlchip}. They also focus mainly on speeding up
     34the parser computation itself and are limited by the poor memory
     35behaviour.
    4736
    48 The first generation of Parabix XML parser~\cite{CameronHerdyLin2008},
    49 which applies a sequential bit scan method, has already shown a
    50 substantial improvement on performance. The latest version or the
    51 second generation of Parabix XML parser~\cite{Cameron2010} introduced
    52 a new idea, parallel bit scan, which provides us a more efficient
    53 scanning and better utilization of the resources.
     37
     38
     39%
     40% Introduce Parabix.
     41%
     42
     43
     44
     45
     46%However, classical Dennard voltage scaling has reached its limits
     47%which gives the traditional byte-at-a-time processing methods little
     48%space for further improvement. An alternative is to increase energy
     49%efficiency by operating at more optimal core frequencies and achieve
     50%better performance with a larger number of cores.
     51%~\cite{blake-isca-2010}, in scaling our
     52%chips~\cite{esmaeilzadeh-isca-2011}
     53
     54Recently, we developed a novel representation of
     55data~\cite{Cameron2008,CameronLin2009}, Parabix (Parallel bitstreams),
     56to aid parsers and text processing tools. Parabix transposes
     57byte-oriented character data into parallel bit streams, where each bit
     58represents one character from the input data. We explored the use of
     59Parabix representation in UTF-8 to UTF-16 conversion and in specific
     60internal passes of an XML parser~\cite{cameron-EuroPar2011}.
     61
     62
     63
     64
     65
     66
     67%The first generation of Parabix XML parser~\cite{CameronHerdyLin2008},
     68%which applies a sequential bit scan method, has already shown a
     69%substantial improvement on performance. The latest version or the
     70%second generation of Parabix XML parser~\cite{Cameron2010} introduced
     71%a new idea, parallel bit scan, which provides us a more efficient
     72%scanning and better utilization of the resources.
    5473
    5574
     
    6382
    6483
    65 In this paper, We present Parabix tool chain, a novel execution framework
    66 and software runtime environment that can be used to dramatically improve
    67 the efficiency of text processing and parsing on commodity processors.
     84
     85
     86
     87
     88
     89
     90
     91
     92In this paper, we generalize parallel bitstreams and develop the
     93Parabix programming framework to help programmers build text
     94processing appliations. The programmers specify the perations on
     95unbounded character lists using bitstreams in a python environment,
     96while our code generation and runtime translate them into low-level
     97C++ routines.  The Parabix routines exploit the SIMD extensions on
     98commodity processors (SSE/AVX on x86, Neon on ARM) to process hundreds
     99of character positions in an input stream simultaneously dramatically
     100improving the execution efficiency. We describe the overall Parabix
     101tool chain, a novel execution framework and software build environment
     102that enables text processing applications to effectively exploit
     103commodity multicores.
     104
     105
     106We apply Parabix technology to the problem of XML parsing.
    68107Figure~\ref{perf-energy} showcases the overall efficiency of our
    69 framework. The Parabix-XML parser improves the
    70 performance %by ?$\times$
    71 and energy efficiency %by ?$\times$
    72 several-fold compared
    73 to widely-used software parsers, approaching the
    74 %?$cycles/input-byte$
    75 performance of ASIC XML
    76 parsers~\cite{xmlchip,DaiNiZhu2010}.
     108framework. The Parabix-XML parser improves the performance %by
     109?$\times$ and energy efficiency
     110%by ?$\times$ several-fold compared to widely-used software parsers,
     111approaching the
     112%?$cycles/input-byte$ performance of ASIC XML
     113parsers~\cite{xmlchip,DaiNiZhu2010}. The Parabix-XML parser exploits
     114the bitstream technology to dramatically reduce branches in the
     115parsing routines resulting in a more efficient pipeline. It also
     116substantially improves register utilization which minimizes energy
     117wasted on cache misses and data transfers. We make the following contributions:
     118
     119
    77120\footnote{The actual energy consumption of the XML
    78121  ASIC chips is not published by the companies.}
    79122%
    80 Overall we make the following contributions:
    81123
    82 1) We outline the Parabix architecture, tool chain and runtime
    83 environment and describe how it may be used to produce efficient
    84 XML parser implementations on a variety of commodity processors.
    85 While studied in the context of XML parsing, the Parabix framework
    86 can be widely applied to many problems in text processing and
    87 parsing.  We have released Parabix completely open source
    88 and are interested in exploring the applications that can take
    89 advantage of our tool chain (\textit{http://parabix.costar.sfu.ca/}).
     1241) We outline the Parabix architecture, code-generation tool chain and
     125runtime environment and describe how it may be used to produce
     126efficient XML parser implementations on a variety of commodity
     127processors.  While studied in the context of XML parsing, the Parabix
     128framework can be widely applied to many problems in text processing
     129and parsing.  We have released Parabix completely open source and are
     130interested in exploring the applications that can take advantage of
     131our tool chain (\textit{http://parabix.costar.sfu.ca/}).
    90132
    91133
    921342) We compare the Parabix XML parser against conventional parsers and
    93135assess the improvement in overall performance and energy efficiency on
    94 variety of hardware platforms.  We are the first to compare and
     136variety of hardware platforms.  We use Parabix to study and
    95137contrast SSE/AVX extensions across multiple generation of Intel
    96138processors and show that there are performance challenges when using
     
    99141operations.
    100142
    101 3) Finally, building on the SIMD parallelism of Parabix technology,
    102 we multithread the Parabix XML parser to enable the different
     1433) Finally, we multithread the Parabix XML parser to enable the different
    103144stages in the parser to exploit SIMD units across all the cores.
    104145This further improves performance while maintaining the energy consumption
    105 constant with the sequential version.
     146comparable with the sequential version.
    106147
    107148
     
    112153architecture, tool chain and runtime environment.
    113154Section~\ref{section:parser} describes the design of an XML parser
    114 based on the Parabix framework. Section~\ref{section:methodology} describes the evaluation framework and Section~\ref{section:baseline}
     155based on the Parabix framework. Section~\ref{section:baseline}
    115156presents a detailed performance analysis of Parabix on a
    116157\CITHREE\ system using hardware performance counters.
     
    121162AVX technology and comments on the benefits and challenges compared to
    122163the 128-bit SSE instructions.  Finally,
    123 Section~\ref{section:multithread} looks at the multithreading of the
    124 Parabix XML parser which seeks to exploit the SIMD units scattered
    125 across multiple cores.
     164Section~\ref{section:multithread} looks at multithreading to exploit
     165the SIMD units scattered across multiple cores.
    126166
    127167
Note: See TracChangeset for help on using the changeset viewer.