Changeset 1348 for docs/HPCA2012


Ignore:
Timestamp:
Aug 23, 2011, 1:02:30 AM (8 years ago)
Author:
ashriram
Message:

new abstract for new intro

Location:
docs/HPCA2012
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/00-abstract.tex

    r1302 r1348  
    1 XML is a set of rules for the encoding of documents in machine-readable form.
    2 The simplicity and generality of the rules make it widely used in web services and database
    3 systems.  Traditional XML parsers are built around a
    4 byte-at-a-time processing model where each character token
    5 of an XML document is examined in sequence.  Unfortunately, the byte-at-a-time
    6 sequential model is a performance barrier in more demanding applications,
    7 is energy-inefficient, and makes poor use of the wide SIMD registers
    8 and other parallelism features of modern processors.
     1In modern applications text files are employed widely. For example,
     2XML files provide data storage in human readable format and are widely
     3used in web services, database systems, and mobile phone SDKs.
     4Traditional text processing tools are built around a byte-at-a-time
     5processing model where each character token of a document is
     6examined. The byte-at-a-time model is highly challenging for commodity
     7processors. It includes many unpredictable input-dependent branches
     8which cause pipeline squashes and stalls. Furthermore, typical text
     9processing tools perform few operations per processed character and
     10experience high cache miss rate when parsing the file. Overall,
     11parsing text in important domains like XML processing require high
     12performance and hardware designers have adopted customized hardware
     13and ASIC solutions.
    914
    10 This paper assesses the energy and performance of a new approach
    11 to XML parsing, based on parallel bit stream technology, and as implemented on successive
    12 software generations of the Parabix XML parser.
    13 In Parabix, we first convert character streams into sets of parallel
    14 bit streams. We then exploit the SIMD operations prevalent on commodity-level hardware for performance.
    15 The first generation Parabix1 parser exploits the processor built-in $bitscan$ instructions
    16 over these streams to make multibyte moves but follows an otherwise sequential
    17 approach.  The second generation Parabix2 technology adds further
    18 parallelism by replacing much of the sequential
    19 bit scanning with a parallel scanning approach based on bit stream
    20 addition.  We evaluate Parabix1 and Parabix2
    21 against two widely used XML parsers, James Clark's Expat and Apache's Xerces, and
    22 across three generations of x86 machines, including the new Intel
    23 \SB{}.  We show that Parabix2's speedup is 2$\times$--7$\times$
    24 over Expat and Xerces.  In stark contrast to the energy expenditures necessary
    25 to realize performance gains through multicore parallelism, we also show
    26 that our Parabix parsers deliver energy savings in direct proportion
    27 to the gains in performance.  In addition, we assess the scalability advantages
    28 of SIMD processor improvements across Intel processor generations,
    29 culminating with an evaluation of the 256-bit AVX technology in
    30 \SB{} versus the now legacy 128-bit SSE technology.
     15% In this paper on commodity.
     16% We expose through a toolchain.
     17% We demonstrate what can be achieved with branches etc.
     18% We study various tradeoffs.
     19% Finally we show the benefits can be stacked
    3120
     21In this paper we enable text processing applications to effectively
     22use commodity processors. We introduce Parabix (Parallel Bitstream)
     23technology, a software runtime and execution model that applications
     24to exploit modern SIMD instructions extensions for high performance
     25text processing. Parabix enables the application developer to write
     26constructs assuming unlimited SIMD data parallelism. Our runtime
     27translator generates code based on machine specifics (e.g., SIMD
     28register widths) to realize the programmer specifications.  The key
     29insight into efficient text processing in Parabix is the data
     30organization. It transposes the sequence of 8-byte characters into
     31sets of 8 parallel bit streams which then enables us to operate on
     32multiple characters with a single bit-parallel SIMD operators. We
     33demonstrate the features and efficiency of parabix with a XML parsing
     34application. We evaluate Parabix-based XML parser against two widely
     35used XML parsers, Expat and Apache's Xerces, and across three
     36generations of x86 processors, including the new Intel \SB{}.  We show
     37that Parabix's speedup is 2$\times$--7$\times$ over Expat and
     38Xerces. We observe that Parabix overall makes efficient use of
     39intra-core parallel hardware on commodity processors and supports
     40significant gains in energy. Using Parabix, we assess the scalability
     41advantages of SIMD processor improvements across Intel processor
     42generations, culminating with a look at the latex 256-bit AVX
     43technology in \SB{} versus the now legacy 128-bit SSE technology. As
     44part of this study we also preview the Neon extensions on ARM
     45processors. Finally, we partition the XML program into pipeline stages
     46and demonstrate that thread-level parallelism exploits SIMD units
     47scattered across the different cores and improves performance
     48(2$\times$ on 4 cores) at same energy levels as the single-thread
     49version.
     50
     51
     52
  • docs/HPCA2012/01-intro.tex

    r1342 r1348  
    11\section{Introduction}
    2 We have now long since reached the limit to classical Dennard voltage scaling~\cite{},
    3 that enabled us to keep all of transistors afforded by Moore's law
    4 active. This has already resulted in a rethink
    5 of the way general-purpose processors are built: processor frequencies
    6 have remained stagnant over the last 5 years with the capability to
    7 boost core speeds on Intel multicores only if other
    8 cores on the chip are shut off. Chip makers strive to achieve energy
    9 efficient computing by operating at more optimal core frequencies and
    10 aim to increase performance with larger number of
    11 cores. Unfortunately, given the limited levels of
    12 parallelism that can be found in applications~\cite{blake-isca-2010},
    13 it is not certain how many cores can be productively used in
    14 scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because
    15 exploiting parallelism across multiple cores tends to require
    16 heavweight threads that are difficult to manage and synchronize.
     2We have now long since reached the limit to classical Dennard voltage
     3scaling that enabled us to keep all of transistors afforded by
     4Moore's law active. This has already resulted in a rethink of the way
     5general-purpose processors are built: processor frequencies have
     6remained stagnant over the last 5 years with the capability to boost
     7core speeds on Intel multicores only if other cores on the chip are
     8shut off. Chip makers strive to achieve energy efficient computing by
     9operating at more optimal core frequencies and aim to increase
     10performance with larger number of cores. Unfortunately, given the
     11limited levels of parallelism that can be found in
     12applications~\cite{blake-isca-2010}, it is not certain how many cores
     13can be productively used in scaling our
     14chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting
     15parallelism across multiple cores tends to require heavweight threads
     16that are difficult to manage and synchronize.
    1717
    1818The desire to improve the overall efficiency of computing is pushing
     
    3131computing efficiency of commodity processors.
    3232
    33 In this paper, we tackle the infamous ``thirteenth dwarf'' (parsers/finite
    34 state machines) that is considered to be the hardest application
    35 class to parallelize~\cite{Asanovic:EECS-2006-183} and show how Parabix,
    36 a novel software architecture, tool chain and run-time environment
    37 can indeed be used to dramatically improve parsing efficiency on
    38 commodity processors.   
    39 Based on the concept of transposing
    40 byte-oriented character data into parallel bit streams for the
    41 individual bits of each byte, the Parabix framework exploits the SIMD
    42 extensions (SSE/AVX on x86, Neon on ARM) on commodity processors
    43 to process hundreds of character positions in an input
    44 stream simultaneously.  To achieve transposition, Parabix exploits
    45 sophisticated SIMD instructions that enable data elements to be packed and
    46 unpacked from registers in a regular manner which improve the overall cache access
    47 behavior of the application resulting in significantly fewer
    48 misses and better utilization.
    49 Parabix also dramatically reduces branches
    50 in parsing code resulting in a more efficient pipeline and substantially
    51 improves register/cache utilization which minimizes energy wasted on data
    52 transfers.   
    53 
    54 We study Parabix technology in application to the problem of XML parsing
    55 and develop several implementations for different computing platforms.
    56 XML is a particularly interesting application; it is a standard of the
    57 web consortium that provides a common framework for encoding and
     33In this paper, we tackle the infamous ``thirteenth dwarf''
     34(parsers/finite state machines) that is considered to be the hardest
     35application class to parallelize~\cite{Asanovic:EECS-2006-183} and
     36show how Parabix, a novel software architecture, tool chain and
     37run-time environment can indeed be used to dramatically improve
     38parsing efficiency on commodity processors.  Based on the concept of
     39transposing byte-oriented character data into parallel bit streams for
     40the individual bits of each byte, the Parabix framework exploits the
     41SIMD extensions (SSE/AVX on x86, Neon on ARM) on commodity processors
     42to process hundreds of character positions in an input stream
     43simultaneously.  To achieve transposition, Parabix exploits
     44sophisticated SIMD instructions that enable data elements to be packed
     45and unpacked from registers in a regular manner which improve the
     46overall cache access behavior of the application resulting in
     47significantly fewer misses and better utilization.  Parabix also
     48dramatically reduces branches in parsing code resulting in a more
     49efficient pipeline and substantially improves register/cache
     50utilization which minimizes energy wasted on data transfers.
     51
     52We apply Parabix technology to the problem of XML parsing and develop
     53several implementations for different computing platforms.  XML is a
     54particularly interesting application; it is a standard of the web
     55consortium that provides a common framework for encoding and
    5856communicating data.  XML provides critical data storage for
    5957applications ranging from Office Open XML in Microsoft Office to NDFD
     
    6159Castor XML in the Martian Rovers, a XML data in Android phones.  XML
    6260parsing efficiency is important for multiple application areas; in
    63 server workloads the key focus in on overall transactions per second
    64 while in applications in the network switches and cell phones latency
    65 and the energy cost of parsing is of paramount
    66 importance.   Traditional software-based XML parsers have many
    67 inefficiencies due to complex input-dependent branching structures
    68 leading to considerable branch misprediction penalties as well
    69 as poor use of memory bandwidth and data caches due to byte-at-a-time
    70 processing and multiple buffering.  XML ASIC chips have been around for over 6
    71 years, but typically lag behind CPUs in technology due to cost
    72 constraints. Our focus is how much can we improve performance of the
    73 XML parser on commodity processors with Parabix technology.
     61server workloads the key focus in on overall transactions per second,
     62while in applications in the network switches and cell phones, latency
     63and the energy are of paramount importance.  Traditional
     64software-based XML parsers have many inefficiencies due to complex
     65input-dependent branching structures leading to considerable branch
     66misprediction penalties as well as poor use of memory bandwidth and
     67data caches due to byte-at-a-time processing and multiple buffering.
     68XML ASIC chips have been around for over 6 years, but typically lag
     69behind CPUs in technology due to cost constraints. Our focus is how
     70much can we improve performance of the XML parser on commodity
     71processors with Parabix technology.
    7472
    7573In the end, as summarized by
  • docs/HPCA2012/09-pipeline.tex

    r1347 r1348  
    66Moreover, using mulitiple cores, we can further improve the performance of Parabix while keeping the energy consumption at the same level.
    77
    8 A typical approach to parallelizing software, data parallelism, requires nearly independent data,
    9 However, the nature of XML files makes them hard to partition nicely for data parallelism.
    10 Several approaches have been used to address this problem.
    11 A preparsing phase has been proposed to help partition the XML document \cite{dataparallel}.
    12 The goal of this preparsing is to determine the tree structure of the XML document
    13 so that it can be used to guide the full parsing in the next phase.
    14 Another data parallel algorithm is called ParDOM \cite{Shah:2009}.
    15 It first builds partial DOM node tree structures for each data segments and then link them
    16 using preorder numbers that has been assigned to each start element to determine the ordering among siblings
    17 and a stack to manage the parent-child relationship between elements.
     8A typical approach to parallelizing software, data parallelism,
     9requires nearly independent data, However, the nature of XML files
     10makes them hard to partition nicely for data parallelism.  Several
     11approaches have been used to address this problem.  A preparsing phase
     12has been proposed to help partition the XML document
     13\cite{dataparallel}.  The goal of this preparsing is to determine the
     14tree structure of the XML document so that it can be used to guide the
     15full parsing in the next phase.  Another data parallel algorithm is
     16called ParDOM \cite{Shah:2009}.  It first builds partial DOM node tree
     17structures for each data segments and then link them using preorder
     18numbers that has been assigned to each start element to determine the
     19ordering among siblings and a stack to manage the parent-child
     20relationship between elements.
    1821
    19 Data parallelism approaches introduce a lot of overheads to solve the data dependencies between segments.
    20 Therefore, instead of partitioning the data into segments and assigning different data segments to different cores,
    21 we propose a pipeline parallelism strategy that partitions the process into several stages and let each core work with one single stage.
     22Data parallelism approaches introduce a lot of overheads to solve the
     23data dependencies between segments.  Therefore, instead of
     24partitioning the data into segments and assigning different data
     25segments to different cores, we propose a pipeline parallelism
     26strategy that partitions the process into several stages and let each
     27core work with one single stage.
    2228
    23 The interface between stages is implemented using a circular array,
    24 where each entry consists of all ten data structures for one segment as listed in Table \ref{pass_structure}.
    25 Each thread keeps an index of the array ($I_N$),
    26 which is compared with the index ($I_{N-1}$) kept by its previous thread before processing the segment.
    27 If $I_N$ is smaller than $I_{N-1}$, thread N can start processing segment $I_N$,
    28 otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is larger than $I_N$.
    29 The time consumed by continuously loading the value of $I_{N-1}$ and
    30 comparing it with $I_N$ will be later referred as stall time.
    31 When a thread finishes processing the segment, it increases the index by one.
     29The interface between stages is implemented using a circular array,
     30where each entry consists of all ten data structures for one segment
     31as listed in Table \ref{pass_structure}.  Each thread keeps an index
     32of the array ($I_N$), which is compared with the index ($I_{N-1}$)
     33kept by its previous thread before processing the segment.  If $I_N$
     34is smaller than $I_{N-1}$, thread N can start processing segment
     35$I_N$, otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is
     36larger than $I_N$.  The time consumed by continuously loading the
     37value of $I_{N-1}$ and comparing it with $I_N$ will be later referred
     38as stall time.  When a thread finishes processing the segment, it
     39increases the index by one.
    3240
    3341\begin{table*}[t]
  • docs/HPCA2012/main.tex

    r1343 r1348  
    127127%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    128128% ACM title header format
    129 \title{\vspace{-30pt} Boosting the Efficiency of Text Processing on Commodity Processors :\\ The Parabix Story.
     129\title{\vspace{-30pt} Boosting the Efficiency of Text Processing on Commodity Processors: The Parabix Story
    130130%
    131131% \thanks{%
Note: See TracChangeset for help on using the changeset viewer.