Changeset 1743


Ignore:
Timestamp:
Nov 30, 2011, 11:30:44 AM (7 years ago)
Author:
ashriram
Message:

First pass final version [ashriram]

Location:
docs/HPCA2012
Files:
16 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/01-intro.tex

    r1691 r1743  
    11\section{Introduction}
    22
    3 As a result of information expansion and diversification of the data format,
    4 the demands of high performance and energy efficient text processing are rapidly increasing.
    5 However, classical Dennard voltage scaling has reached its limits
    6 which gives the traditional byte-at-a-time processing methods little space
    7 for further improvement. An alternative is to increase energy efficiency
    8 by operating at more optimal core frequencies and achieve better performance
    9 with a larger number of cores. Unfortunately, given the limited levels of parallelism
    10 that can be found in applications~\cite{blake-isca-2010}, especailly text processing,
    11 in which, many applications, for example, XML parsing, are sequential by nature,
    12 it is not certain how many cores can be productively used in scaling our
    13 chips~\cite{esmaeilzadeh-isca-2011}. In a widely cited Berkeley study~\cite{Asanovic:EECS-2006-183},
    14 the infamous ``thirteenth dwarf'' (parsers/finite state machines) is considered to be the hardest
    15 application class to parallelize.
     3As a result of information expansion and diversification of the data
     4format, the demand for high performance and energy efficient text
     5processing is rapidly rising. A widely-used text-based data storage
     6format is XML. XML is a standard of the web consortium that provides a
     7common framework for encoding and communicating data.  XML is used in
     8applications ranging from Office Open XML in Microsoft Office to NDFD
     9XML of the NOAA National Weather Service, from KML in Google Earth to
     10Castor XML in the Martian Rovers. In a widely cited Berkeley
     11study~\cite{Asanovic:EECS-2006-183}, the ``thirteenth dwarf''
     12(parsers/finite state machines) which processes text is considered to be
     13the hardest application class to parallelize.
    1614
    17 A new technology, Parabix, was introduced to exploit the SIMD extensions on commodity processors
    18 to process hundreds of character positions in an input stream simultaneously~\cite{Cameron2008}.
    19 Parabix first transposes byte-oriented character data into parallel bit streams
    20 using sophisticated SIMD instructions that enable data elements to be packed into registers.
    21 With the bit streams, where each bit represents one character from the input data, the text can then
    22 be processed in parallel within the SIMD registers.
    23 This improves the overall cache behaviour of the application resulting in significantly
    24 fewer misses and better utilization.  Parabix also dramatically
    25 reduces branches in the parsing routines resulting in a more efficient
    26 pipeline and substantially improves register utilization which
    27 minimizes energy wasted on data transfers.
    2815
    29 We apply Parabix technology to the problem of XML parsing.  XML is a
    30 standard of the web consortium that provides a common framework for
    31 encoding and communicating data.  XML provides critical data storage
    32 for applications ranging from Office Open XML in Microsoft Office to
    33 NDFD XML of the NOAA National Weather Service, from KML in Google
    34 Earth to Castor XML in the Martian Rovers.  XML parsing efficiency is
    35 important for multiple application areas; in server workloads the key
    36 focus in on overall transactions per second, while in applications for
    37 network switches and cell phones, latency and energy are of paramount
    38 importance.  Conventional software-based XML parsers have many
    39 inefficiencies including considerable branch misprediction penalties
    40 due to complex input-dependent branching structures as well as poor
    41 use of caches and memory bandwidth due to byte-at-a-time
    42 processing.  XML ASIC chips have been around since early 2003, but typically lag behind CPUs in technology due to
    43 cost constraints~\cite{xmlchip}. They also focus mainly on speeding up the parser
    44 computation itself and are limited by the poor memory behaviour.
     16Given the limited levels of parallelism that can be found in text
     17processing, for example, XML parsing, is inherently sequential, it is
     18not certain how many cores can be productively used.  Conventional
     19software-based XML parsers have many inefficiencies including
     20considerable branch misprediction penalties due to complex
     21input-dependent branching structures as well as poor use of caches and
     22memory bandwidth due to byte-at-a-time processing. XML ASIC chips have
     23been around since early 2003, but typically lag behind CPUs in
     24technology due to cost constraints~\cite{xmlchip}. They also focus
     25mainly on speeding up the parser computation itself and are limited by
     26the poor memory behaviour.
     27
     28
     29; in server
     30workloads the key focus in on overall transactions per second, while
     31in applications for network switches and cell phones, latency and
     32energy are of paramount importance.
     33
     34
     35
     36
     37%
     38% Introduce Parabix.
     39%
     40
     41
     42
     43
     44%However, classical Dennard voltage scaling has reached its limits
     45%which gives the traditional byte-at-a-time processing methods little
     46%space for further improvement. An alternative is to increase energy
     47%efficiency by operating at more optimal core frequencies and achieve
     48%better performance with a larger number of cores.
     49%~\cite{blake-isca-2010}, in scaling our
     50%chips~\cite{esmaeilzadeh-isca-2011}
     51
     52A new technology, Parabix, was introduced to exploit the SIMD
     53extensions on commodity processors to process hundreds of character
     54positions in an input stream simultaneously~\cite{Cameron2008}.
     55Parabix first transposes byte-oriented character data into parallel
     56bit streams using sophisticated SIMD instructions that enable data
     57elements to be packed into registers.  With the bit streams, where
     58each bit represents one character from the input data, the text can
     59then be processed in parallel within the SIMD registers.  This
     60improves the overall cache behaviour of the application resulting in
     61significantly fewer misses and better utilization.  Parabix also
     62dramatically reduces branches in the parsing routines resulting in a
     63more efficient pipeline and substantially improves register
     64utilization which minimizes energy wasted on data transfers.
     65
     66We apply Parabix technology to the problem of XML parsing.
    4567Our focus is how much we can improve performance of
    4668the XML parser on commodity processors with Parabix technology.
    4769
    48 The first generation of Parabix XML parser~\cite{CameronHerdyLin2008},
    49 which applies a sequential bit scan method, has already shown a
    50 substantial improvement on performance. The latest version or the
    51 second generation of Parabix XML parser~\cite{Cameron2010} introduced
    52 a new idea, parallel bit scan, which provides us a more efficient
    53 scanning and better utilization of the resources.
     70%The first generation of Parabix XML parser~\cite{CameronHerdyLin2008},
     71%which applies a sequential bit scan method, has already shown a
     72%substantial improvement on performance. The latest version or the
     73%second generation of Parabix XML parser~\cite{Cameron2010} introduced
     74%a new idea, parallel bit scan, which provides us a more efficient
     75%scanning and better utilization of the resources.
    5476
    5577
  • docs/HPCA2012/final_ieee/00-abstract.tex

    r1733 r1743  
    4848% the single-thread version for the XML application.
    4949
     50Modern applications employ text files widely for providing data
     51storage in readable format for applications ranging from database
     52systems to mobile phones. Traditional text processing tools are built
     53around a byte-at-a-time sequential processing model, and introduce
     54significant branch and cache miss penalty. Recently researchers have
     55explored a transposed representation of text, Parabix (Parallel Bit
     56Stream),  to improve the efficiency of text processing.
    5057
    51 Traditional text processing tools are built around a byte-at-a-time
    52 sequential processing model, which is hard to parallelize without special hardware.
    53 However, Parabix (Parallel Bit Stream) technology
    54 enables text processing applications to effectively use commodity processors.
    55 In this paper, we generalize Parabix into a software toolchain and execution
    56 framework that allows applications to exploit modern SIMD instructions for high
    57 performance text processing. This toolchain enables the application developer
    58 to write constructs assuming unlimited SIMD data parallelism and Parabix's
    59 bit stream translator generates code based on machine specifics (e.g.,
    60 SIMD register widths). We demonstrate the features and efficiency of Parabix with
    61 an XML parsing application. We evaluate the Parabix-based parser
    62 against two widely used XML parsers, Expat and Apache's
    63 Xerces. Parabix makes efficient use of intra-core SIMD hardware and
    64 demonstrates 2$\times$--7$\times$ speedup and 4$\times$ improvement in
    65 energy efficiency compared to the conventional parsers. We assess the
    66 scalability of SIMD implementations across three generations of x86
    67 processors including the new \SB{}. We compare the 256-bit AVX
    68 technology in Intel \SB{} versus the now legacy 128-bit SSE technology
    69 and analyze the benefits and challenges of using the AVX
    70 extensions.  Finally, we partition the XML program into pipeline stages
    71 and demonstrate that thread-level parallelism enables the application
    72 to exploits SIMD units scattered across the different cores and
    73 improves performance (2$\times$ on 4 cores) at same energy levels as
    74 the single-thread version for the XML application.
     58In this paper, we explore a general programming framework based on
     59Parabix and describe the software toolchain and execution framework
     60that allows applications to exploit modern SIMD instructions for high
     61performance text processing. The toolchain enables the application
     62developer to write constructs assuming unbounded characters streams
     63and Parabix's code translator generates code based on machine
     64specifics (e.g., SIMD register widths). We demonstrate the features
     65and efficiency of Parabix with an XML parsing application. Parabix
     66exploits intra-core SIMD hardware and demonstrates
     672$\times$--7$\times$ speedup and 4$\times$ improvement in energy
     68efficiency compared to two widely used conventional software parsers,
     69Expat and Apache-Xerces. We study SIMD implementations across three
     70generations of x86 processors including the new \SB{}. We compare the
     71256-bit AVX technology in Intel \SB{} versus the now legacy 128-bit
     72SSE technology and analyze the benefits and challenges 3-operand
     73instruction formats and wider SIMD hardware.  Finally, we partition
     74the XML program into pipeline stages and demonstrate that thread-level
     75parallelism enables the application to exploits SIMD units scattered
     76across the different cores and improves performance (2$\times$ on 4
     77cores) at same energy levels as the single-thread version for the XML
     78application.
  • docs/HPCA2012/final_ieee/01-intro.tex

    r1733 r1743  
    11\section{Introduction}
     2Modern applications ranging from web search to analytics are mainly
     3data centric operating large swathes of data. Information expansion
     4and diversification of data has resulted in multiple textual storage
     5formats. XML is a widely-used text-based data storage format. XML is a
     6standard of the web consortium that provides a common framework for
     7encoding and communicating data.  It is used in applications ranging
     8from Office Open XML in Microsoft Office to NDFD XML of the NOAA
     9National Weather Service, from KML in Google Earth to Castor XML in
     10the Martian Rovers. To enable these diverse applications we need high
     11performance, scalable, and energy efficient text processing stored in
     12these XML documents.
    213
    3 As a result of information expansion and diversification of the data format,
    4 the demands of high performance and energy efficient text processing are rapidly increasing.
    5 However, classical Dennard voltage scaling has reached its limits
    6 which gives the traditional byte-at-a-time processing methods little space
    7 for further improvement. An alternative is to increase energy efficiency
    8 by operating at more optimal core frequencies and achieve better performance
    9 with a larger number of cores. Unfortunately, given the limited levels of parallelism
    10 that can be found in applications~\cite{blake-isca-2010}, especailly text processing,
    11 in which, many applications, for example, XML parsing, are sequential by nature,
    12 it is not certain how many cores can be productively used in scaling our
    13 chips~\cite{esmaeilzadeh-isca-2011}. In a widely cited Berkeley study~\cite{Asanovic:EECS-2006-183},
    14 the infamous ``thirteenth dwarf'' (parsers/finite state machines) is considered to be the hardest
    15 application class to parallelize.
     14%; in server
     15%workloads the key focus in on overall transactions per second, while
     16%in applications for network switches and cell phones, latency and
     17%energy are of paramount importance.
    1618
    17 A new technology, Parabix, was introduced to exploit the SIMD extensions on commodity processors
    18 to process hundreds of character positions in an input stream simultaneously~\cite{Cameron2008}.
    19 Parabix first transposes byte-oriented character data into parallel bit streams
    20 using sophisticated SIMD instructions that enable data elements to be packed into registers.
    21 With the bit streams, where each bit represents one character from the input data, the text can then
    22 be processed in parallel within the SIMD registers.
    23 This improves the overall cache behaviour of the application resulting in significantly
    24 fewer misses and better utilization.  Parabix also dramatically
    25 reduces branches in the parsing routines resulting in a more efficient
    26 pipeline and substantially improves register utilization which
    27 minimizes energy wasted on data transfers.
    2819
    29 We apply Parabix technology to the problem of XML parsing.  XML is a
    30 standard of the web consortium that provides a common framework for
    31 encoding and communicating data.  XML provides critical data storage
    32 for applications ranging from Office Open XML in Microsoft Office to
    33 NDFD XML of the NOAA National Weather Service, from KML in Google
    34 Earth to Castor XML in the Martian Rovers.  XML parsing efficiency is
    35 important for multiple application areas; in server workloads the key
    36 focus in on overall transactions per second, while in applications for
    37 network switches and cell phones, latency and energy are of paramount
    38 importance.  Conventional software-based XML parsers have many
    39 inefficiencies including considerable branch misprediction penalties
    40 due to complex input-dependent branching structures as well as poor
    41 use of caches and memory bandwidth due to byte-at-a-time
    42 processing.  XML ASIC chips have been around since early 2003, but typically lag behind CPUs in technology due to
    43 cost constraints~\cite{xmlchip}. They also focus mainly on speeding up the parser
    44 computation itself and are limited by the poor memory behaviour.
    45 Our focus is how much we can improve performance of
    46 the XML parser on commodity processors with Parabix technology.
     20Unfortunately, given the limited levels of parallelism that can be
     21found in text processing, for example, XML parsing, is inherently
     22sequential, it is not clear how this important class of application
     23can benefit from the growth in multicore processors. As a widely cited
     24Berkeley study~\cite{Asanovic:EECS-2006-183} reports, the ``thirteenth
     25dwarf'' (parsers/finite state machines) which process text is
     26considered to be the hardest application class to parallelize and
     27process efficienctly.  Conventional software-based text parsers have
     28many inefficiencies including considerable branch misprediction
     29penalties due to complex input-dependent branching structures as well
     30as poor use of caches and memory bandwidth due to byte-at-a-time
     31processing. ASIC chips that process XML textual data have been around
     32since early 2003, but typically lag behind CPUs in technology due to
     33cost constraints~\cite{xmlchip}. They also focus mainly on speeding up
     34the parser computation itself and are limited by the poor memory
     35behaviour.
    4736
    48 The first generation of Parabix XML parser~\cite{CameronHerdyLin2008},
    49 which applies a sequential bit scan method, has already shown a
    50 substantial improvement on performance. The latest version or the
    51 second generation of Parabix XML parser~\cite{Cameron2010} introduced
    52 a new idea, parallel bit scan, which provides us a more efficient
    53 scanning and better utilization of the resources.
     37
     38
     39%
     40% Introduce Parabix.
     41%
     42
     43
     44
     45
     46%However, classical Dennard voltage scaling has reached its limits
     47%which gives the traditional byte-at-a-time processing methods little
     48%space for further improvement. An alternative is to increase energy
     49%efficiency by operating at more optimal core frequencies and achieve
     50%better performance with a larger number of cores.
     51%~\cite{blake-isca-2010}, in scaling our
     52%chips~\cite{esmaeilzadeh-isca-2011}
     53
     54Recently, we developed a novel representation of
     55data~\cite{Cameron2008,CameronLin2009}, Parabix (Parallel bitstreams),
     56to aid parsers and text processing tools. Parabix transposes
     57byte-oriented character data into parallel bit streams, where each bit
     58represents one character from the input data. We explored the use of
     59Parabix representation in UTF-8 to UTF-16 conversion and in specific
     60internal passes of an XML parser~\cite{cameron-EuroPar2011}.
     61
     62
     63
     64
     65
     66
     67%The first generation of Parabix XML parser~\cite{CameronHerdyLin2008},
     68%which applies a sequential bit scan method, has already shown a
     69%substantial improvement on performance. The latest version or the
     70%second generation of Parabix XML parser~\cite{Cameron2010} introduced
     71%a new idea, parallel bit scan, which provides us a more efficient
     72%scanning and better utilization of the resources.
    5473
    5574
     
    6382
    6483
    65 In this paper, We present Parabix tool chain, a novel execution framework
    66 and software runtime environment that can be used to dramatically improve
    67 the efficiency of text processing and parsing on commodity processors.
     84
     85
     86
     87
     88
     89
     90
     91
     92In this paper, we generalize parallel bitstreams and develop the
     93Parabix programming framework to help programmers build text
     94processing appliations. The programmers specify the perations on
     95unbounded character lists using bitstreams in a python environment,
     96while our code generation and runtime translate them into low-level
     97C++ routines.  The Parabix routines exploit the SIMD extensions on
     98commodity processors (SSE/AVX on x86, Neon on ARM) to process hundreds
     99of character positions in an input stream simultaneously dramatically
     100improving the execution efficiency. We describe the overall Parabix
     101tool chain, a novel execution framework and software build environment
     102that enables text processing applications to effectively exploit
     103commodity multicores.
     104
     105
     106We apply Parabix technology to the problem of XML parsing.
    68107Figure~\ref{perf-energy} showcases the overall efficiency of our
    69 framework. The Parabix-XML parser improves the
    70 performance %by ?$\times$
    71 and energy efficiency %by ?$\times$
    72 several-fold compared
    73 to widely-used software parsers, approaching the
    74 %?$cycles/input-byte$
    75 performance of ASIC XML
    76 parsers~\cite{xmlchip,DaiNiZhu2010}.
     108framework. The Parabix-XML parser improves the performance %by
     109?$\times$ and energy efficiency
     110%by ?$\times$ several-fold compared to widely-used software parsers,
     111approaching the
     112%?$cycles/input-byte$ performance of ASIC XML
     113parsers~\cite{xmlchip,DaiNiZhu2010}. The Parabix-XML parser exploits
     114the bitstream technology to dramatically reduce branches in the
     115parsing routines resulting in a more efficient pipeline. It also
     116substantially improves register utilization which minimizes energy
     117wasted on cache misses and data transfers. We make the following contributions:
     118
     119
    77120\footnote{The actual energy consumption of the XML
    78121  ASIC chips is not published by the companies.}
    79122%
    80 Overall we make the following contributions:
    81123
    82 1) We outline the Parabix architecture, tool chain and runtime
    83 environment and describe how it may be used to produce efficient
    84 XML parser implementations on a variety of commodity processors.
    85 While studied in the context of XML parsing, the Parabix framework
    86 can be widely applied to many problems in text processing and
    87 parsing.  We have released Parabix completely open source
    88 and are interested in exploring the applications that can take
    89 advantage of our tool chain (\textit{http://parabix.costar.sfu.ca/}).
     1241) We outline the Parabix architecture, code-generation tool chain and
     125runtime environment and describe how it may be used to produce
     126efficient XML parser implementations on a variety of commodity
     127processors.  While studied in the context of XML parsing, the Parabix
     128framework can be widely applied to many problems in text processing
     129and parsing.  We have released Parabix completely open source and are
     130interested in exploring the applications that can take advantage of
     131our tool chain (\textit{http://parabix.costar.sfu.ca/}).
    90132
    91133
    921342) We compare the Parabix XML parser against conventional parsers and
    93135assess the improvement in overall performance and energy efficiency on
    94 variety of hardware platforms.  We are the first to compare and
     136variety of hardware platforms.  We use Parabix to study and
    95137contrast SSE/AVX extensions across multiple generation of Intel
    96138processors and show that there are performance challenges when using
     
    99141operations.
    100142
    101 3) Finally, building on the SIMD parallelism of Parabix technology,
    102 we multithread the Parabix XML parser to enable the different
     1433) Finally, we multithread the Parabix XML parser to enable the different
    103144stages in the parser to exploit SIMD units across all the cores.
    104145This further improves performance while maintaining the energy consumption
    105 constant with the sequential version.
     146comparable with the sequential version.
    106147
    107148
     
    112153architecture, tool chain and runtime environment.
    113154Section~\ref{section:parser} describes the design of an XML parser
    114 based on the Parabix framework. Section~\ref{section:methodology} describes the evaluation framework and Section~\ref{section:baseline}
     155based on the Parabix framework. Section~\ref{section:baseline}
    115156presents a detailed performance analysis of Parabix on a
    116157\CITHREE\ system using hardware performance counters.
     
    121162AVX technology and comments on the benefits and challenges compared to
    122163the 128-bit SSE instructions.  Finally,
    123 Section~\ref{section:multithread} looks at the multithreading of the
    124 Parabix XML parser which seeks to exploit the SIMD units scattered
    125 across multiple cores.
     164Section~\ref{section:multithread} looks at multithreading to exploit
     165the SIMD units scattered across multiple cores.
    126166
    127167
  • docs/HPCA2012/final_ieee/04-methodology.tex

    r1738 r1743  
    4949\end{table}
    5050
     51\begin{table}[htbp]
     52{
     53  \footnotesize
     54  \begin{center}
     55{
     56\begin{tabular}{|l||@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|}
     57\hline
     58Processor & Core2 Duo & i3-530 & Sandybridge\\ \hline
     59Frequency &  2.13GHz & 2.93GHz & 2.80GHz \\ \hline
     60L1 D Cache & 32KB & 32KB & 32KB \\ \hline       
     61L2 Cache & Shared 2MB & 256KB/core & 256KB/core \\ \hline
     62L3 Cache & --- & 4MB  & 6MB \\ \hline
     63Max TDP & 65W & 73W &  95W \\ \hline
     64\end{tabular}
     65}
     66\end{center}
     67  }
     68\caption{Platform Hardware Specs}
     69\label{hwinfo}
     70\end{table}
     71
    5172
    5273\paragraph{Platform Hardware:}
     
    7495framework.
    7596
    76 \begin{table}[htbp]
    77 \begin{center}
    78 {
    79 \begin{tabular}{|l||@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|}
    80 \hline
    81 Processor & Core2 Duo & i3-530 & Sandybridge\\ \hline
    82 Frequency &  2.13GHz & 2.93GHz & 2.80GHz \\ \hline
    83 L1 D Cache & 32KB & 32KB & 32KB \\ \hline       
    84 L2 Cache & Shared 2MB & 256KB/core & 256KB/core \\ \hline
    85 L3 Cache & --- & 4MB  & 6MB \\ \hline
    86 Memory  & 2GB & 4GB & 6GB\\ \hline
    87 Max TDP & 65W & 73W &  95W \\ \hline
    88 \end{tabular}
    89 }
    90 \end{center}
    91 \caption{Platform Hardware Specs}
    92 \label{hwinfo}
    93 \end{table}
    9497
    9598
  • docs/HPCA2012/final_ieee/05-corei3.tex

    r1738 r1743  
    99%some of the numbers are roughly calculated, needs to be recalculated for final version
    1010\subsection{Cache behavior}
    11 The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is
    12 4, 11, and 36 cycles respectively. The L1 (32KB) and L2 cache (256KB)
    13 are private per core; L3 (4MB) is shared by all the cores.
    14 Table \ref{cache_misses} shows the cache misses per kilobyte
    15 of input data. Analytically, the cache misses for the Expat and Xerces
    16 parsers represent a 0.5 cycle per XML byte cost. This overhead
    17 does not necessarily impact the overall performance of these
    18 parsers as they experience additional overheads related to branch mispredictions.
    19 Compared to Xerces and Expat, the data organization of Parabix-XML significantly
    20 reduces the overall cache miss rate; specifically, there were $7\times$ and $15\times$
    21 fewer L1 and L2 cache misses compared to the next best parser tested. The improved cache
    22 utilization helps keep the SIMD units busy by minimizing memory-related stalls
    23 and lowers the overall energy consumption
    24 by reducing the need to access the higher levels of the cache hierarchy.
    25 Using microbenchmarks, we estimated that the L1,
    26 L2, and L3 cache misses consume $\sim$8.3nJ, $\sim$19nJ, and $\sim$40nJ
    27 respectively. On average, with a 1GB XML file, Expat and Xerces would consume over
    28 0.6J and 0.9J respectively due to cache misses alone.
    29 %With a 1GB input file, Expat would consume more than 0.6J and Xercesn
    30 %would consume 0.9J on cache misses alone.
    31 
    32 
    33 \begin{table}[htbp]
     11
     12
     13\begin{table}[!htbp]
    3414\begin{center}
    3515\begin{tabular}{|c|c|c|c|}
     
    4525\end{table}
    4626
     27
     28Table \ref{cache_misses} shows the cache misses per kilobyte of input
     29data. Analytically, the cache misses for the Expat and Xerces parsers
     30represent a 0.5 cycle per XML byte cost.\footnote{The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is
     314, 11, and 36 cycles respectively.}
     32
     33
     34
     35
     36This overhead does not
     37necessarily impact the overall performance of these parsers as they
     38experience additional overheads related to branch mispredictions.
     39Compared to Xerces and Expat, the data organization of Parabix-XML
     40significantly reduces the overall cache miss rate; specifically, there
     41were $7\times$ and $15\times$ fewer L1 and L2 cache misses compared to
     42the next best parser tested. The improved cache utilization helps keep
     43the SIMD units busy by minimizing memory-related stalls and lowers the
     44overall energy consumption by reducing the need to access the higher
     45levels of the cache hierarchy.  Using microbenchmarks, we estimated
     46that the L1, L2, and L3 cache misses consume $\sim$8.3nJ, $\sim$19nJ,
     47and $\sim$40nJ respectively. On average, with a 1GB XML file, Expat
     48and Xerces would consume over 0.6J and 0.9J respectively due to cache
     49misses alone.
     50%With a 1GB input file, Expat would consume more than 0.6J and Xercesn
     51%would consume 0.9J on cache misses alone.
     52
     53
     54
    4755\subsection{Branch Mispredictions}
    4856\label{section:XML-branches}
    49 In general, performance is limited by branch mispredictions.
    50 Unfortunately, it is difficult to reduce the branch misprediction rate of
    51 traditional XML parsers due to:
    52 (1) the variable length nature of the syntactic elements contained within XML documents;
    53 (2) a data dependent characteristic, and
    54 (3) the extensive set of syntax constraints imposed by the XML 1.0/1.1 specifications.
    5557% Branch mispredictions are known
    5658% to signficantly degrade XML parsing performance in proportion to the markup density of the source document
    5759% \cite{CameronHerdyLin2008}.
    58 As shown in Figure \ref{corei3_BR},
    59 Xerces averages up to 13 branches per XML byte processed on high density
    60 markup. On modern commodity processors the cost of a single branch
    61 misprediction is on the order of 10s of CPU cycles spent to restart the processor
    62 pipeline.
    63 
    64 The high miss prediction rate in conventional parsers is a significant overhead.
    65 In Parabix-XML, the use of SIMD operations eliminates many branches.
    66 Most conditional branches can be replaced with
    67 bitwise operations, which can process up to 128 characters worth of
    68 branches with one operation
    69 or with a series of logical predicate operations, which are cheaper
    70 to compute since they require only SIMD operations.
    71 
    72 As shown in Figure \ref{corei3_BR},
    73 Parabix-XML is nearly branch free and exhibits minimal dependence on the
    74 source markup density. Specifically, it experiences between 19.5 and
    75 30.7 branch mispredictions per kB of XML data. Conversely, the cost of
    76 branch mispredictions for the Expat parser can be over 7 cycles per
    77 XML byte (see Figure \ref{corei3_BM}) --- which exceeds
    78 the average latency of a byte processed by Parabix-XML.
    79 
    80 
    81 
    82 
    83 \begin{figure}
    84 \begin{center}
    85 {
    86 \subfigure[Branch Instructions / kB]{
    87 \includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
    88 \label{corei3_BR}
    89 }
    90 \hfill
    91 \subfigure[Branch Misses / kB]{
     60
     61The performance of traditional parsers is limited by their branch
     62behavior.  Xerces experiences up to 13 branches per input XML
     63character on the high markup files; Expat experiences up to 8 branches
     64per XML character.  In Parabix-XML, the use of SIMD operations
     65eliminates many branches.  Most conditional branches can be replaced
     66with bitwise operations, which can process up to 128 characters worth
     67of branches with one operation or with a series of logical predicate
     68operations, which are cheaper to compute since they require only SIMD
     69operations.
     70
     71
     72The high miss prediction rate in conventional parsers is a significant
     73overhead. The cost of a single branch misprediction is on the order of
     7410s of CPU cycles spent to restart the processor pipeline on a
     75misprediction. Parabix-XML is nearly branch free and exhibits minimal
     76dependence on the source markup density. Specifically, it experiences
     77between 19.5 and 30.7 branch mispredictions per kB of XML
     78data. Conversely, the cost of branch mispredictions for the Expat
     79parser can be over 7 cycles per XML byte (see Figure~\ref{corei3_BM})
     80--- which exceeds the average latency of a byte processed by
     81Parabix-XML.
     82
     83Unfortunately, it is difficult to reduce the branch misprediction rate
     84of traditional XML parsers due to: (1) the variable length nature of
     85the syntactic elements contained within XML documents; (2) input data
     86dependent characteristic, and (3) the extensive set of syntax
     87constraints imposed by the XML specifications.
     88
     89
     90
     91\begin{figure}[!h]
     92\begin{center}
     93%{
     94%\subfigure[Branch Instructions / kB]{
     95%\includegraphics[width=0.5\textwidth]{plots/corei3_BR.pdf}
     96%\label{corei3_BR}
     97%}
     98%\hfill
     99%\subfigure[Branch Misses / kB]{
    92100\includegraphics[width=0.5\textwidth]{plots/corei3_BM.pdf}
     101%}
     102\caption{Branch Mispredictions on the \CITHREE{}. (/ 1kB input)}
    93103\label{corei3_BM}
    94 }
    95 }
    96 \end{center}
    97 \caption{Branch characteristics on the \CITHREE\ per kB of input data.}
     104
     105%}
     106\end{center}
    98107\end{figure}
    99108
     
    133142% smaller.
    134143
    135 \subsection{CPU Cycles}
     144\begin{table}[htbp]
     145\begin{center}
     146{
     147\begin{tabular}{|@{~}l@{~}||@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|}
     148\hline
     149File Name               & dew.xml       & jaw.xml       & roads.gml     & po.xml        & soap.xml \\ \hline   
     150SIMD                    & 81.68\%       & 80.59\%       & 70.7\%        & 66.02\%       & 59.9\%   \\ \hline   
     151Non-SIMD                & 18.32\%       & 19.41\%       & 29.3\%        & 33.98\%       & 40.1\%
     152 \\ \hline
     153\end{tabular}
     154}
     155\end{center}
     156\caption{SIMD Instruction Percentage}
     157\label{corei3_INS_p2}
     158\end{table}
     159
     160
     161
     162\subsection{Performance and Energy Characteristics}
    136163
    137164Figure \ref{corei3_TOT} shows overall parser performance in
     
    145172requires less than a single cycle per byte.
    146173
    147 \begin{table}[htbp]
    148 \begin{center}
    149 {
    150 \begin{tabular}{|@{~}l@{~}||@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|@{~}l@{~}|}
    151 \hline
    152 File Name               & dew.xml       & jaw.xml       & roads.gml     & po.xml        & soap.xml \\ \hline   
    153 SIMD                    & 81.68\%       & 80.59\%       & 70.7\%        & 66.02\%       & 59.9\%   \\ \hline   
    154 Non-SIMD                & 18.32\%       & 19.41\%       & 29.3\%        & 33.98\%       & 40.1\%
    155  \\ \hline
    156 \end{tabular}
     174
     175 The energy trends shown in Figure \ref{corei3_energy} reveal an
     176 interesting trend. Parabix consumes substantially less energy than
     177 the other parsers. Parabix consumes 50 to 75 nJ per byte while Expat
     178 and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte
     179 respectively. Parabix-XML experiences minimal increase in power
     180 ($\sim5\%$) compared to the conventional parsers. While the SIMD
     181 functional units are significantly wider than the scalar
     182 counterparts, register width and functional unit power account only
     183 for a small fraction of the overall power consumption in a processor
     184 pipeline. Parabix amortizes the fetch and data access overheads over
     185 multiple data parallel operations. Although Parabix requires
     186 slightly more power (per instruction), the processing time of Parabix
     187 is significantly lower resulting in an overall improvement in energy.
     188
     189\begin{figure*}[!htbp]
     190\begin{center}
     191\subfigure[Performance (CPU Cycles per kB)]{
     192\includegraphics[width=0.45\textwidth]{plots/corei3_TOT.pdf}
     193\label{corei3_TOT}
    157194}
    158 \end{center}
    159 \caption{SIMD Instruction Percentage}
    160 \label{corei3_INS_p2}
    161 \end{table}
    162 
    163 
    164 \begin{figure}[htbp]
    165 \begin{center}
    166 {
    167 \includegraphics[width=0.5\textwidth]{plots/corei3_TOT.pdf}
    168 }
    169 \end{center}
    170 \caption{Performance (CPU Cycles per kB)}
    171 \label{corei3_TOT}
    172 \end{figure}
    173 
    174 
    175 
    176 \subsection{Power and Energy}
    177 In this section, we study the power and energy consumption of Parabix-XML
    178 in comparison with Expat and Xerces on \CITHREE{}.
    179 Figure \ref{corei3_power} shows the
    180 average power consumed by each parser. Parabix-XML, dominated by SIMD
    181 instructions, uses $\sim5\%$ additional power. While the
    182 SIMD functional units are significantly wider than the scalar
    183 counterparts, register width and functional unit power account only
    184 for a small fraction of the overall power consumption in a processor
    185 pipeline. More importantly by using data parallel operations Parabix
    186 amortizes the fetch and data access overheads. This results in minimal
    187 power increase compared to the conventional parsers.  Perhaps the
    188 energy trends shown in Figure \ref{corei3_energy} reveal an
    189 interesting trend. Parabix consumes substantially less energy than the
    190 other parsers. Parabix consumes 50 to 75 nJ per byte while Expat and
    191 Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively.
    192 Although Parabix requires slightly more power (per instruction), the
    193 processing time of Parabix is significantly lower.
    194 
    195 
    196 \begin{figure}
    197 \begin{center}
    198 {
    199 \subfigure[Avg. Power (Watts)]{
    200 \includegraphics[width=0.5\textwidth]{plots/corei3_power.pdf}
    201 \label{corei3_power}
    202 }
    203 \hfill
    204195\subfigure[Energy Consumption ($\mu$J per kB)]{
    205 \includegraphics[width=0.5\textwidth]{plots/corei3_energy.pdf}
     196\includegraphics[width=0.45\textwidth]{plots/corei3_energy.pdf}
    206197\label{corei3_energy}
    207198}
    208 }
    209 \end{center}
    210 \caption{Power profile of Parabix on \CITHREE{}}
    211 \end{figure}
    212 
    213 
     199\caption{Performance and Energy profile of Parabix on Core i3}
     200\end{center}
     201\end{figure*}
     202
     203
  • docs/HPCA2012/final_ieee/06-scalability.tex

    r1738 r1743  
    1 \section{Evaluation of Parabix across different Hardware}
     1\section{Parabix on different platforms}
    22\label{section:scalability}
    33\subsection{Performance}
     
    3232in power consumption over the previous generation. Parabix-XML on \SB\ consumes 72\%--75\% less energy than it did on \CO{}.
    3333
    34 
    35 \begin{figure}
     34\begin{figure}[!htb]
    3635\begin{center}
    3736{
     
    4241\label{Parabix_all_platform}
    4342\end{figure}
     43
     44
     45\begin{figure*}[!htbp]
     46\begin{center}
     47{
     48\subfigure[ARM Neon Performance (cycles per kB)]{
     49\includegraphics[width=0.3\textwidth]{plots/arm_TOT.pdf}
     50\label{arm_processing_time}
     51}
     52\hfill
     53\subfigure[ARM Neon]{
     54\includegraphics[width=0.32\textwidth]{plots/Markup_density_Arm.pdf}
     55\label{relative_performance_arm}
     56}
     57\hfill
     58\subfigure[Core i3]{
     59\includegraphics[width=0.32\textwidth]{plots/Markup_density_Intel.pdf}
     60\label{relative_performance_intel}
     61}
     62}
     63\end{center}
     64\caption{Comparison of Parabix-XML on ARM vs. Intel.}
     65\end{figure*}
    4466
    4567
     
    80102of \NEON{} SIMD operations.
    81103
    82 \begin{figure*}[htbp]
    83 \begin{center}
    84 {
    85 \subfigure[ARM Neon Performance (cycles per kB)]{
    86 \includegraphics[width=0.3\textwidth]{plots/arm_TOT.pdf}
    87 \label{arm_processing_time}
    88 }
    89 \hfill
    90 \subfigure[ARM Neon]{
    91 \includegraphics[width=0.32\textwidth]{plots/Markup_density_Arm.pdf}
    92 \label{relative_performance_arm}
    93 }
    94 \hfill
    95 \subfigure[Core i3]{
    96 \includegraphics[width=0.32\textwidth]{plots/Markup_density_Intel.pdf}
    97 \label{relative_performance_intel}
    98 }
    99 }
    100 \end{center}
    101 \caption{Comparison of Parabix-XML on ARM vs. Intel.}
    102 \end{figure*}
    103 
    104104
    105105
     
    125125
    126126
     127\begin{figure*}[!htbp]
     128\begin{center}
     129\includegraphics[height=0.25\textheight]{plots/InsMix.pdf}
     130\end{center}
     131\caption{Parabix Instruction Counts (y-axis: Instructions per kB)}
     132\label{insmix}
     133\end{figure*}
    127134
    128135
     136
     137
     138
  • docs/HPCA2012/final_ieee/07-avx.tex

    r1733 r1743  
    11\section{Parabix on AVX}
     2
     3
    24\label{section:avx}
    35In this section, we discuss the scalability and performance advantages
     
    3133SIMD instruction count of Parabix on AVX.  However, in the \SB\ AVX
    3234implementation, Intel has focused primarily on floating point
    33 operations as opposed to the integer based operations.  256-bit SIMD
    34 is available for loads, stores, bitwise logic and floating operations,
    35 whereas SIMD integer operations and shifts are only available in the
    36 128-bit form.
     35operations.  256-bit SIMD is available for loads, stores, bitwise
     36logic and floating operations, whereas SIMD integer operations and
     37shifts are only available in the 128-bit form.
    3738
    3839
     
    6263
    6364
    64 \begin{figure*}[htbp]
    65 \begin{center}
    66 \includegraphics[height=0.25\textheight]{plots/InsMix.pdf}
    67 \end{center}
    68 \caption{Parabix Instruction Counts (y-axis: Instructions per kB)}
    69 \label{insmix}
    70 \end{figure*}
    71 
    72 \begin{figure}[!h]
    73 \begin{center}
    74 \includegraphics[width=0.5\textwidth]{plots/avx.pdf}
    75 \end{center}
    76 \caption{Parabix Performance (y-axis: ns per kB)}
    77 \label{avx}
    78 \end{figure}
    79 
    8065Note that, in each workload, the number of non-SIMD instructions
    8166remains relatively constant with each workload.  As expected,
     
    9075reduction is also observed when Parabix-XML utilized the AVX runtime
    9176library.
     77
    9278
    9379%[AS] Check numbers.
     
    11298implementations, further performance and energy benefits
    11399could be realized in Parabix-XML.
     100
     101
     102\begin{figure}[!htb]
     103\begin{center}
     104\includegraphics[width=0.5\textwidth]{plots/avx.pdf}
     105\end{center}
     106\caption{Parabix Performance (y-axis: ns per kB)}
     107\label{avx}
     108\end{figure}
     109
  • docs/HPCA2012/final_ieee/08-arm.tex

    r1733 r1743  
    11\def\CORTEXA8{Cortex-A8}
     2
     3
     4\begin{figure}[!htb]
     5\subfigure[ARM Neon Performance]{
     6\includegraphics[width=0.5\textwidth]{plots/arm_TOT.pdf}
     7\label{arm_processing_time}
     8}
     9\hfill
     10\subfigure[Performance ARM Neon vs Core i3 SSE.]{
     11\includegraphics[width=0.5\textwidth]{plots/RelativePerformanceARMvsCoreI3.pdf}
     12\label{relative_performance_arm_vs_i3}
     13}
     14\end{figure}
     15
    216
    317\section {Parabix on Mobile Platforms}
     
    1933
    2034\subsection{Performance Results}
    21 
    22 \begin{figure}
    23 \subfigure[ARM Neon Performance]{
    24 \includegraphics[width=0.5\textwidth]{plots/arm_TOT.pdf}
    25 \label{arm_processing_time}
    26 }
    27 \hfill
    28 \subfigure[Performance ARM Neon vs Core i3 SSE.]{
    29 \includegraphics[width=0.5\textwidth]{plots/RelativePerformanceARMvsCoreI3.pdf}
    30 \label{relative_performance_arm_vs_i3}
    31 }
    32 \end{figure}
    3335
    3436Migration of Parabix2 to the Android platform began with the
  • docs/HPCA2012/final_ieee/09-pipeline.tex

    r1738 r1743  
    3030\footnotesize
    3131\begin{center}
    32 \begin{tabular}{|c|c|c|}
     32\begin{tabular}{|@{~}c@{~}|@{~}c@{~}|@{~}c@{~}|}
    3333\hline
    3434
  • docs/HPCA2012/final_ieee/10-related.tex

    r1733 r1743  
    88% Event-based SAX (Simple API for XML) parsers avoid the tree
    99% construction costs of the more flexible DOM (Document Object Model)
    10 % parsers \cite{Perkins05}.
    11 Nicola and John specifically identified the traditional method of XML
    12 parsing as a threat to database performance and outlined a number of
    13 potential directions for improving performance \cite{NicolaJohn03}.
    14 The commercial importance of XML parsing has spurred the development
    15 of numerous multi-threaded and hardware-based approaches:
    16 Multithreaded XML techniques include preparsing the XML file to locate
    17 key partitioning points~\cite{ParaDOM2009,LiWangLiuLi2009} and
    18 speculative p-DFAs~\cite{ZhangPanChiu09}. Hardware methods include
    19 custom XML chips \cite{Leventhal2009} and FPGA-based implementations
    20 \cite{DaiNiZhu2010}.  Intel's SSE4 instructions targeted
     10% parsers \cite{Perkins05}.  Nicola and John specifically identified
     11the traditional method of XML parsing as a threat to database
     12performance and outlined a number of potential directions for
     13improving performance \cite{NicolaJohn03}.  The commercial importance
     14of XML parsing has spurred the development of numerous multi-threaded
     15and hardware-based approaches: Multithreaded XML techniques include
     16preparsing the XML file to locate key partitioning
     17points~\cite{ParaDOM2009,LiWangLiuLi2009} and speculative
     18p-DFAs~\cite{ZhangPanChiu09}. Hardware methods include custom XML
     19chips \cite{Leventhal2009} and FPGA-based implementations
     20\cite{DaiNiZhu2010}. Others have explored the design of custom
     21hardware for bit parallel operations for text search in network
     22processors~\cite{tan-sherwood-isca-2005}. Intel's SSE4 instructions targeted
    2123XML parsers, but these have not seen widespread use because of portability
    2224concerns and the programming challenges that accompany low level
    23 instructions~\cite{sse4}. Recently, Cameron et
    24 al.~\cite{CameronHerdyLin2008, cameron-EuroPar2011} designed an
    25 accelerated XML parser using widely available SSE2 instructions
    26 and proposed an inductive doubling instruction set ~\cite{CameronLin2009},
    27 by which the performance can further improved.
    28 Finally, others have explored the design of custom
    29 hardware for bit parallel operations for text search in network
    30 processors~\cite{tan-sherwood-isca-2005}.
     25instructions~\cite{sse4}.
     26
     27Recently, Cameron et al.~\cite{CameronHerdyLin2008,
     28  cameron-EuroPar2011} accelerated specific phases in an XML parser
     29using widely available SSE2 instructions and proposed an inductive
     30doubling instruction set ~\cite{CameronLin2009}. In this paper, we
     31have developed a generalized parabix architecture and have described
     32the software tool chain that programmers can use to build scalable
     33text processing applications on commodity multicores. We have explored
     34in the detail the tradeoffs between the SIMD implementations across
     35processor generations (i.e., SSE vs AVX) and multiple platfoms (ARM vs
     36Intel). Finally, we have also explored the benefits of pipeline parallelism.
     37
     38
     39
     40
     41
    3142
    3243
  • docs/HPCA2012/final_ieee/final.aux

    r1738 r1743  
    11\relax
    2 \citation{blake-isca-2010}
    3 \citation{esmaeilzadeh-isca-2011}
    42\citation{Asanovic:EECS-2006-183}
    5 \citation{Cameron2008}
    63\citation{xmlchip}
    7 \citation{CameronHerdyLin2008}
    8 \citation{Cameron2010}
     4\citation{Cameron2008,CameronLin2009}
     5\citation{cameron-EuroPar2011}
     6\@writefile{toc}{\contentsline {section}{\numberline {1}Introduction}{1}}
     7\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces XML Parser Technology Energy vs. Performance\relax }}{1}}
     8\providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
     9\newlabel{perf-energy}{{1}{1}}
    910\citation{xmlchip,DaiNiZhu2010}
    10 \@writefile{toc}{\contentsline {section}{\numberline {1}Introduction}{1}}
    1111\citation{TR:XML}
    1212\citation{xerces}
    13 \@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces XML Parser Technology Energy vs. Performance\relax }}{2}}
    14 \providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
    15 \newlabel{perf-energy}{{1}{2}}
    1613\@writefile{toc}{\contentsline {section}{\numberline {2}Background}{2}}
    1714\newlabel{section:background}{{2}{2}}
     
    2522\@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces Example 7-bit ASCII Basis Bit Streams\relax }}{3}}
    2623\newlabel{fig:BitStreamsExample}{{3}{3}}
     24\@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces Lexical Parsing in Parabix\relax }}{4}}
     25\newlabel{fig:ParabixParsingExample}{{4}{4}}
    2726\@writefile{toc}{\contentsline {subsection}{\numberline {3.2}Parabix Compilers}{4}}
    2827\newlabel{parabix tool chain}{{3.2}{4}}
     
    3433\citation{expat}
    3534\citation{TR:XML}
    36 \@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces Lexical Parsing in Parabix\relax }}{5}}
    37 \newlabel{fig:ParabixParsingExample}{{4}{5}}
    3835\@writefile{toc}{\contentsline {subsection}{\numberline {3.3}Parabix Runtime Libraries}{5}}
    3936\@writefile{toc}{\contentsline {section}{\numberline {4}The Parabix XML Parser}{5}}
     
    4239\newlabel{section:methodology}{{5}{5}}
    4340\newlabel{parsers}{{5}{5}}
     41\@writefile{toc}{\contentsline {paragraph}{XML Parsers:}{5}}
     42\newlabel{workloads}{{5}{5}}
     43\@writefile{toc}{\contentsline {paragraph}{XML Workloads:}{5}}
    4444\citation{bellosa2001,bertran2010}
    4545\citation{clamp}
    4646\@writefile{lof}{\contentsline {figure}{\numberline {7}{\ignorespaces Parabix XML Parser Structure\relax }}{6}}
    4747\newlabel{parabix_arch}{{7}{6}}
    48 \@writefile{toc}{\contentsline {paragraph}{XML Parsers:}{6}}
    49 \newlabel{workloads}{{5}{6}}
    50 \@writefile{toc}{\contentsline {paragraph}{XML Workloads:}{6}}
    5148\@writefile{lot}{\contentsline {table}{\numberline {1}{\ignorespaces XML Document Characteristics\relax }}{6}}
    5249\newlabel{XMLDocChars}{{1}{6}}
    53 \@writefile{toc}{\contentsline {paragraph}{Platform Hardware:}{6}}
    5450\@writefile{lot}{\contentsline {table}{\numberline {2}{\ignorespaces Platform Hardware Specs\relax }}{6}}
    5551\newlabel{hwinfo}{{2}{6}}
     52\@writefile{toc}{\contentsline {paragraph}{Platform Hardware:}{6}}
    5653\@writefile{toc}{\contentsline {paragraph}{Energy Measurement:}{6}}
    57 \@writefile{toc}{\contentsline {section}{\numberline {6}Efficiency of the Parabix-XML Parser}{7}}
    58 \newlabel{section:baseline}{{6}{7}}
    59 \@writefile{toc}{\contentsline {subsection}{\numberline {6.1}Cache behavior}{7}}
    60 \@writefile{lot}{\contentsline {table}{\numberline {3}{\ignorespaces Cache Misses per kB of input data\relax }}{7}}
    61 \newlabel{cache_misses}{{3}{7}}
     54\@writefile{toc}{\contentsline {section}{\numberline {6}Efficiency of the Parabix-XML Parser}{6}}
     55\newlabel{section:baseline}{{6}{6}}
     56\@writefile{toc}{\contentsline {subsection}{\numberline {6.1}Cache behavior}{6}}
     57\@writefile{lot}{\contentsline {table}{\numberline {3}{\ignorespaces Cache Misses per kB of input data\relax }}{6}}
     58\newlabel{cache_misses}{{3}{6}}
    6259\@writefile{toc}{\contentsline {subsection}{\numberline {6.2}Branch Mispredictions}{7}}
    6360\newlabel{section:XML-branches}{{6.2}{7}}
    64 \newlabel{corei3_BR}{{8(a)}{7}}
    65 \newlabel{sub@corei3_BR}{{(a)}{7}}
    66 \newlabel{corei3_BM}{{8(b)}{7}}
    67 \newlabel{sub@corei3_BM}{{(b)}{7}}
    68 \@writefile{lof}{\contentsline {figure}{\numberline {8}{\ignorespaces Branch characteristics on the Core-i3\ per kB of input data.\relax }}{7}}
    69 \@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {Branch Instructions / kB}}}{7}}
    70 \@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {Branch Misses / kB}}}{7}}
     61\@writefile{lof}{\contentsline {figure}{\numberline {8}{\ignorespaces Branch Mispredictions on the Core-i3{}. (/ 1kB input)\relax }}{7}}
     62\newlabel{corei3_BM}{{8}{7}}
    7163\@writefile{toc}{\contentsline {subsection}{\numberline {6.3}SIMD Instructions vs. Total Instructions}{7}}
    72 \@writefile{toc}{\contentsline {subsection}{\numberline {6.4}CPU Cycles}{7}}
    73 \@writefile{lot}{\contentsline {table}{\numberline {4}{\ignorespaces SIMD Instruction Percentage\relax }}{8}}
    74 \newlabel{corei3_INS_p2}{{4}{8}}
    75 \@writefile{lof}{\contentsline {figure}{\numberline {9}{\ignorespaces Performance (CPU Cycles per kB)\relax }}{8}}
    76 \newlabel{corei3_TOT}{{9}{8}}
    77 \@writefile{toc}{\contentsline {subsection}{\numberline {6.5}Power and Energy}{8}}
    78 \@writefile{toc}{\contentsline {section}{\numberline {7}Evaluation of Parabix across different Hardware}{8}}
    79 \newlabel{section:scalability}{{7}{8}}
    80 \@writefile{toc}{\contentsline {subsection}{\numberline {7.1}Performance}{8}}
    81 \newlabel{section:scalability:intel}{{7.1}{8}}
    82 \newlabel{corei3_power}{{10(a)}{8}}
    83 \newlabel{sub@corei3_power}{{(a)}{8}}
    84 \newlabel{corei3_energy}{{10(b)}{8}}
     64\@writefile{lot}{\contentsline {table}{\numberline {4}{\ignorespaces SIMD Instruction Percentage\relax }}{7}}
     65\newlabel{corei3_INS_p2}{{4}{7}}
     66\@writefile{toc}{\contentsline {subsection}{\numberline {6.4}Performance and Energy Characteristics}{7}}
     67\@writefile{toc}{\contentsline {section}{\numberline {7}Parabix on different platforms}{7}}
     68\newlabel{section:scalability}{{7}{7}}
     69\@writefile{toc}{\contentsline {subsection}{\numberline {7.1}Performance}{7}}
     70\newlabel{section:scalability:intel}{{7.1}{7}}
     71\newlabel{corei3_TOT}{{9(a)}{8}}
     72\newlabel{sub@corei3_TOT}{{(a)}{8}}
     73\newlabel{corei3_energy}{{9(b)}{8}}
    8574\newlabel{sub@corei3_energy}{{(b)}{8}}
    86 \@writefile{lof}{\contentsline {figure}{\numberline {10}{\ignorespaces Power profile of Parabix on Core-i3{}\relax }}{8}}
    87 \@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {Avg. Power (Watts)}}}{8}}
     75\@writefile{lof}{\contentsline {figure}{\numberline {9}{\ignorespaces Performance and Energy profile of Parabix on Core i3\relax }}{8}}
     76\@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {Performance (CPU Cycles per kB)}}}{8}}
    8877\@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {Energy Consumption ($\mu $J per kB)}}}{8}}
     78\@writefile{lof}{\contentsline {figure}{\numberline {10}{\ignorespaces Parabix on various hardware platforms\relax }}{8}}
     79\newlabel{Parabix_all_platform}{{10}{8}}
    8980\@writefile{toc}{\contentsline {subsection}{\numberline {7.2}Parabix on Mobile processors}{8}}
    9081\newlabel{section:scalability:Neon{}}{{7.2}{8}}
    91 \@writefile{lof}{\contentsline {figure}{\numberline {11}{\ignorespaces Parabix on various hardware platforms\relax }}{9}}
    92 \newlabel{Parabix_all_platform}{{11}{9}}
     82\newlabel{arm_processing_time}{{11(a)}{9}}
     83\newlabel{sub@arm_processing_time}{{(a)}{9}}
     84\newlabel{relative_performance_arm}{{11(b)}{9}}
     85\newlabel{sub@relative_performance_arm}{{(b)}{9}}
     86\newlabel{relative_performance_intel}{{11(c)}{9}}
     87\newlabel{sub@relative_performance_intel}{{(c)}{9}}
     88\@writefile{lof}{\contentsline {figure}{\numberline {11}{\ignorespaces Comparison of Parabix-XML on ARM vs. Intel.\relax }}{9}}
     89\@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {ARM Neon Performance (cycles per kB)}}}{9}}
     90\@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {ARM Neon}}}{9}}
     91\@writefile{lof}{\contentsline {subfigure}{\numberline{(c)}{\ignorespaces {Core i3}}}{9}}
    9392\@writefile{toc}{\contentsline {section}{\numberline {8}Parabix on AVX}{9}}
    9493\newlabel{section:avx}{{8}{9}}
     
    9897\citation{dataparallel}
    9998\citation{Shah:2009}
    100 \newlabel{arm_processing_time}{{12(a)}{10}}
    101 \newlabel{sub@arm_processing_time}{{(a)}{10}}
    102 \newlabel{relative_performance_arm}{{12(b)}{10}}
    103 \newlabel{sub@relative_performance_arm}{{(b)}{10}}
    104 \newlabel{relative_performance_intel}{{12(c)}{10}}
    105 \newlabel{sub@relative_performance_intel}{{(c)}{10}}
    106 \@writefile{lof}{\contentsline {figure}{\numberline {12}{\ignorespaces Comparison of Parabix-XML on ARM vs. Intel.\relax }}{10}}
    107 \@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {ARM Neon Performance (cycles per kB)}}}{10}}
    108 \@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {ARM Neon}}}{10}}
    109 \@writefile{lof}{\contentsline {subfigure}{\numberline{(c)}{\ignorespaces {Core i3}}}{10}}
    110 \@writefile{lof}{\contentsline {figure}{\numberline {14}{\ignorespaces Parabix Performance (y-axis: ns per kB)\relax }}{10}}
    111 \newlabel{avx}{{14}{10}}
     99\@writefile{lof}{\contentsline {figure}{\numberline {12}{\ignorespaces Parabix Instruction Counts (y-axis: Instructions per kB)\relax }}{10}}
     100\newlabel{insmix}{{12}{10}}
     101\@writefile{lof}{\contentsline {figure}{\numberline {13}{\ignorespaces Parabix Performance (y-axis: ns per kB)\relax }}{10}}
     102\newlabel{avx}{{13}{10}}
    112103\@writefile{toc}{\contentsline {section}{\numberline {9}Multithreaded Parabix}{10}}
    113104\newlabel{section:multithread}{{9}{10}}
     105\@writefile{lot}{\contentsline {table}{\numberline {5}{\ignorespaces Stage Division\relax }}{10}}
     106\newlabel{pass_structure}{{5}{10}}
    114107\citation{DaiNiZhu2010}
    115108\citation{NicolaJohn03}
     
    118111\citation{Leventhal2009}
    119112\citation{DaiNiZhu2010}
     113\citation{tan-sherwood-isca-2005}
    120114\citation{sse4}
    121115\citation{CameronHerdyLin2008,cameron-EuroPar2011}
    122116\citation{CameronLin2009}
    123 \citation{tan-sherwood-isca-2005}
    124 \@writefile{lof}{\contentsline {figure}{\numberline {13}{\ignorespaces Parabix Instruction Counts (y-axis: Instructions per kB)\relax }}{11}}
    125 \newlabel{insmix}{{13}{11}}
    126 \@writefile{lot}{\contentsline {table}{\numberline {5}{\ignorespaces Stage Division\relax }}{11}}
    127 \newlabel{pass_structure}{{5}{11}}
    128 \@writefile{lof}{\contentsline {figure}{\numberline {15}{\ignorespaces Average Statistic of Multithreaded Parabix\relax }}{11}}
    129 \newlabel{multithread_perf}{{15}{11}}
     117\@writefile{lof}{\contentsline {figure}{\numberline {14}{\ignorespaces Average Statistic of Multithreaded Parabix\relax }}{11}}
     118\newlabel{multithread_perf}{{14}{11}}
    130119\@writefile{toc}{\contentsline {section}{\numberline {10}Related Work}{11}}
    131120\newlabel{section:related}{{10}{11}}
     121\@writefile{toc}{\contentsline {section}{\numberline {11}Conclusion}{11}}
     122\newlabel{section:conclusion}{{11}{11}}
    132123\bibstyle{ieee/latex8}
    133124\bibdata{reference}
     
    150141\bibcite{Leventhal2009}{17}
    151142\bibcite{xmlchip}{18}
    152 \@writefile{toc}{\contentsline {section}{\numberline {11}Conclusion}{12}}
    153 \newlabel{section:conclusion}{{11}{12}}
    154143\bibcite{LiWangLiuLi2009}{19}
    155144\bibcite{dataparallel}{20}
  • docs/HPCA2012/final_ieee/final.log

    r1738 r1743  
    1 This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009/Debian) (format=pdflatex 2011.4.5)  24 NOV 2011 11:16
     1This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009/Debian) (format=pdflatex 2011.10.12)  29 NOV 2011 17:45
    22entering extended mode
    33 %&-line parsing enabled.
     
    66LaTeX2e <2009/09/24>
    77Babel <v3.8l> and hyphenation patterns for english, usenglishmax, dumylang, noh
    8 yphenation, loaded.
     8yphenation, farsi, arabic, croatian, bulgarian, ukrainian, russian, czech, slov
     9ak, danish, dutch, finnish, french, basque, ngerman, german, german-x-2009-06-1
     109, ngerman-x-2009-06-19, ibycus, monogreek, greek, ancientgreek, hungarian, san
     11skrit, italian, latin, latvian, lithuanian, mongolian2a, mongolian, bokmal, nyn
     12orsk, romanian, irish, coptic, serbian, turkish, welsh, esperanto, uppersorbian
     13, estonian, indonesian, interlingua, icelandic, kurmanji, slovenian, polish, po
     14rtuguese, spanish, galician, catalan, swedish, ukenglish, pinyin, loaded.
    915(./preamble-final-ieee.tex (/usr/share/texmf-texlive/tex/latex/base/article.cls
    1016Document Class: article 2007/10/19 v1.4h Standard LaTeX document class
     
    110116
    111117(/usr/share/texmf-texlive/tex/latex/pdftex-def/pdftex.def
    112 File: pdftex.def 2010/03/12 v0.04p Graphics/color for pdfTeX
     118File: pdftex.def 2009/08/25 v0.04m Graphics/color for pdfTeX
    113119\Gread@gobject=\count99
    114120))
     
    180186See the caption package documentation for explanation.
    181187
    182 Package caption Info: \@makecaption = \long macro:#1#2-> \vskip 10pt \setbox \@
     188Package caption Info: \@makecaption = \long macro:#1#2-> \vskip 40pt \setbox \@
    183189tempboxa \hbox {\tenhv \noindent #1.~#2} \setlength {\@ctmp }{\hsize } \addtole
    184190ngth {\@ctmp }{-\@figindent }\addtolength {\@ctmp }{-\@figindent } \ifdim \wd \
     
    304310)
    305311LaTeX Info: Redefining \= on input line 18.
    306 LaTeX Info: Redefining \underscore on input line 185.
    307 LaTeX Info: Redefining \code on input line 201.
     312LaTeX Info: Redefining \underscore on input line 186.
     313LaTeX Info: Redefining \code on input line 202.
    308314
    309315
     
    314320\openout1 = `final.aux'.
    315321
    316 LaTeX Font Info:    Checking defaults for OML/cmm/m/it on input line 227.
    317 LaTeX Font Info:    ... okay on input line 227.
    318 LaTeX Font Info:    Checking defaults for T1/cmr/m/n on input line 227.
    319 LaTeX Font Info:    ... okay on input line 227.
    320 LaTeX Font Info:    Checking defaults for OT1/cmr/m/n on input line 227.
    321 LaTeX Font Info:    ... okay on input line 227.
    322 LaTeX Font Info:    Checking defaults for OMS/cmsy/m/n on input line 227.
    323 LaTeX Font Info:    ... okay on input line 227.
    324 LaTeX Font Info:    Checking defaults for OMX/cmex/m/n on input line 227.
    325 LaTeX Font Info:    ... okay on input line 227.
    326 LaTeX Font Info:    Checking defaults for U/cmr/m/n on input line 227.
    327 LaTeX Font Info:    ... okay on input line 227.
    328 LaTeX Font Info:    Try loading font information for OT1+ptm on input line 227.
     322LaTeX Font Info:    Checking defaults for OML/cmm/m/it on input line 228.
     323LaTeX Font Info:    ... okay on input line 228.
     324LaTeX Font Info:    Checking defaults for T1/cmr/m/n on input line 228.
     325LaTeX Font Info:    ... okay on input line 228.
     326LaTeX Font Info:    Checking defaults for OT1/cmr/m/n on input line 228.
     327LaTeX Font Info:    ... okay on input line 228.
     328LaTeX Font Info:    Checking defaults for OMS/cmsy/m/n on input line 228.
     329LaTeX Font Info:    ... okay on input line 228.
     330LaTeX Font Info:    Checking defaults for OMX/cmex/m/n on input line 228.
     331LaTeX Font Info:    ... okay on input line 228.
     332LaTeX Font Info:    Checking defaults for U/cmr/m/n on input line 228.
     333LaTeX Font Info:    ... okay on input line 228.
     334LaTeX Font Info:    Try loading font information for OT1+ptm on input line 228.
    329335
    330336 (/usr/share/texmf-texlive/tex/latex/psnfss/ot1ptm.fd
     
    350356Package caption Info: End \AtBeginDocument code.
    351357LaTeX Font Info:    Font shape `OT1/ptm/bx/n' in size <14.4> not available
    352 (Font)              Font shape `OT1/ptm/b/n' tried instead on input line 245.
     358(Font)              Font shape `OT1/ptm/b/n' tried instead on input line 246.
    353359LaTeX Font Info:    Try loading font information for OT1+ztmcm on input line 24
    354 5.
     3606.
    355361 (/usr/share/texmf-texlive/tex/latex/psnfss/ot1ztmcm.fd
    356362File: ot1ztmcm.fd 2000/01/03 Fontinst v1.801 font definitions for OT1/ztmcm.
    357363)
    358364LaTeX Font Info:    Try loading font information for OML+ztmcm on input line 24
    359 5.
     3656.
    360366
    361367(/usr/share/texmf-texlive/tex/latex/psnfss/omlztmcm.fd
     
    363369)
    364370LaTeX Font Info:    Try loading font information for OMS+ztmcm on input line 24
    365 5.
     3716.
    366372
    367373(/usr/share/texmf-texlive/tex/latex/psnfss/omsztmcm.fd
     
    369375)
    370376LaTeX Font Info:    Try loading font information for OMX+ztmcm on input line 24
    371 5.
     3776.
    372378
    373379(/usr/share/texmf-texlive/tex/latex/psnfss/omxztmcm.fd
     
    375381)
    376382LaTeX Font Info:    Font shape `OT1/ptm/bx/n' in size <12> not available
    377 (Font)              Font shape `OT1/ptm/b/n' tried instead on input line 245.
     383(Font)              Font shape `OT1/ptm/b/n' tried instead on input line 246.
    378384LaTeX Font Info:    Font shape `OT1/ptm/bx/n' in size <9> not available
    379 (Font)              Font shape `OT1/ptm/b/n' tried instead on input line 245.
     385(Font)              Font shape `OT1/ptm/b/n' tried instead on input line 246.
    380386LaTeX Font Info:    Font shape `OT1/ptm/bx/n' in size <7> not available
    381 (Font)              Font shape `OT1/ptm/b/n' tried instead on input line 245.
    382 LaTeX Font Info:    Try loading font information for OMS+ptm on input line 245.
     387(Font)              Font shape `OT1/ptm/b/n' tried instead on input line 246.
     388LaTeX Font Info:    Try loading font information for OMS+ptm on input line 246.
    383389
    384390
     
    387393)
    388394LaTeX Font Info:    Font shape `OMS/ptm/m/n' in size <12> not available
    389 (Font)              Font shape `OMS/cmsy/m/n' tried instead on input line 245.
     395(Font)              Font shape `OMS/cmsy/m/n' tried instead on input line 246.
    390396 (./00-abstract.tex
    391397LaTeX Font Info:    Font shape `OT1/ptm/bx/n' in size <10> not available
    392 (Font)              Font shape `OT1/ptm/b/n' tried instead on input line 64.
     398(Font)              Font shape `OT1/ptm/b/n' tried instead on input line 67.
    393399LaTeX Font Info:    Font shape `OT1/ptm/bx/n' in size <7.4> not available
    394 (Font)              Font shape `OT1/ptm/b/n' tried instead on input line 64.
     400(Font)              Font shape `OT1/ptm/b/n' tried instead on input line 67.
    395401LaTeX Font Info:    Font shape `OT1/ptm/bx/n' in size <6> not available
    396 (Font)              Font shape `OT1/ptm/b/n' tried instead on input line 64.
     402(Font)              Font shape `OT1/ptm/b/n' tried instead on input line 67.
    397403)
    398404(./01-intro.tex
     
    400406File: plots/performance_energy_chart.pdf Graphic file (type pdf)
    401407
    402 <use plots/performance_energy_chart.pdf>
     408<use plots/performance_energy_chart.pdf> [1{/var/lib/texmf/fonts/map/pdftex/upd
     409map/pdftex.map}
     410
     411
     412 <./plots/performance_energy_chart.pdf>]
    403413LaTeX Font Info:    Font shape `OT1/ptm/bx/n' in size <8> not available
    404 (Font)              Font shape `OT1/ptm/b/n' tried instead on input line 78.
     414(Font)              Font shape `OT1/ptm/b/n' tried instead on input line 121.
    405415LaTeX Font Info:    Font shape `OT1/ptm/bx/n' in size <5> not available
    406 (Font)              Font shape `OT1/ptm/b/n' tried instead on input line 78.
    407  [1{/var/lib/texmf/fonts/map/pdftex/updmap/pdftex.map}
    408 
    409 
    410 ]) (./02-background.tex
     416(Font)              Font shape `OT1/ptm/b/n' tried instead on input line 121.
     417) (./02-background.tex
    411418LaTeX Font Info:    Try loading font information for OT1+pcr on input line 55.
    412419
     
    416423LaTeX Font Info:    Font shape `OT1/pcr/bx/n' in size <8> not available
    417424(Font)              Font shape `OT1/pcr/b/n' tried instead on input line 56.
    418  [2 <./plots/performance_energy_chart.pdf>]) (./03-research.tex
     425 [2]) (./03-research.tex
    419426Overfull \hbox (3.99174pt too wide) in paragraph at lines 32--37
    420427 []
     
    451458
    452459[5]
    453 Underfull \hbox (badness 1286) in paragraph at lines 98--111
     460Underfull \hbox (badness 1286) in paragraph at lines 101--114
    454461[] \OT1/ptm/b/n/10 En-ergy Mea-sure-ment:[] \OT1/ptm/m/n/10 A key ben-e-fit of
    455462the Para-bix
     
    457464
    458465) (./05-corei3.tex [6 <./plots/parabix_arch.pdf>]
    459 <plots/corei3_BR.pdf, id=68, 454.69875pt x 206.7725pt>
    460 File: plots/corei3_BR.pdf Graphic file (type pdf)
    461 
    462 <use plots/corei3_BR.pdf>
    463 <plots/corei3_BM.pdf, id=70, 440.64626pt x 202.7575pt>
     466<plots/corei3_BM.pdf, id=68, 440.64626pt x 202.7575pt>
    464467File: plots/corei3_BM.pdf Graphic file (type pdf)
    465468
    466469<use plots/corei3_BM.pdf>
    467 Overfull \hbox (12.22688pt too wide) in paragraph at lines 89--96
    468  []
    469  []
    470 
    471 
    472 Overfull \hbox (12.22688pt too wide) in paragraph at lines 89--96
    473  []
    474  []
    475 
    476 [7 <./plots/corei3_BR.pdf> <./plots/corei3_BM.pdf>]
    477 Overfull \hbox (7.49034pt too wide) in paragraph at lines 150--158
    478  []
    479  []
    480 
    481 <plots/corei3_TOT.pdf, id=104, 457.71pt x 209.78375pt>
     470Overfull \hbox (7.22688pt too wide) in paragraph at lines 100--102
     471 []
     472 []
     473
     474
     475Overfull \hbox (7.49034pt too wide) in paragraph at lines 147--155
     476 []
     477 []
     478
     479<plots/corei3_TOT.pdf, id=70, 457.71pt x 209.78375pt>
    482480File: plots/corei3_TOT.pdf Graphic file (type pdf)
    483481
    484482<use plots/corei3_TOT.pdf>
    485 Overfull \hbox (7.22688pt too wide) in paragraph at lines 167--169
    486  []
    487  []
    488 
    489 <plots/corei3_power.pdf, id=106, 451.6875pt x 208.78pt>
    490 File: plots/corei3_power.pdf Graphic file (type pdf)
    491 
    492 <use plots/corei3_power.pdf>
    493 <plots/corei3_energy.pdf, id=108, 454.69875pt x 203.76125pt>
     483<plots/corei3_energy.pdf, id=72, 454.69875pt x 203.76125pt>
    494484File: plots/corei3_energy.pdf Graphic file (type pdf)
    495485
    496 <use plots/corei3_energy.pdf>
    497 Overfull \hbox (12.22688pt too wide) in paragraph at lines 202--209
    498  []
    499  []
    500 
    501 
    502 Overfull \hbox (12.22688pt too wide) in paragraph at lines 202--209
    503  []
    504  []
    505 
    506 ) (./06-scalability.tex
    507 <plots/Parabix2_all_platform.pdf, id=110, 432.61626pt x 263.98625pt>
     486<use plots/corei3_energy.pdf>) (./06-scalability.tex [7 <./plots/corei3_BM.pdf>
     487] <plots/Parabix2_all_platform.pdf, id=92, 432.61626pt x 263.98625pt>
    508488File: plots/Parabix2_all_platform.pdf Graphic file (type pdf)
    509489
    510490<use plots/Parabix2_all_platform.pdf>
    511 Overfull \hbox (7.22688pt too wide) in paragraph at lines 38--40
    512  []
    513  []
    514 
    515 [8 <./plots/corei3_TOT.pdf> <./plots/corei3_power.pdf> <./plots/corei3_energy.p
    516 df>] <plots/arm_TOT.pdf, id=157, 424.58624pt x 283.0575pt>
     491Overfull \hbox (7.22688pt too wide) in paragraph at lines 37--39
     492 []
     493 []
     494
     495<plots/arm_TOT.pdf, id=93, 424.58624pt x 283.0575pt>
    517496File: plots/arm_TOT.pdf Graphic file (type pdf)
    518 
    519 <use plots/arm_TOT.pdf>
    520 <plots/Markup_density_Arm.pdf, id=159, 369.38pt x 252.945pt>
     497 <use plots/arm_TOT.pdf>
     498<plots/Markup_density_Arm.pdf, id=95, 369.38pt x 252.945pt>
    521499File: plots/Markup_density_Arm.pdf Graphic file (type pdf)
    522500
    523501<use plots/Markup_density_Arm.pdf>
    524 <plots/Markup_density_Intel.pdf, id=161, 370.38374pt x 252.945pt>
     502<plots/Markup_density_Intel.pdf, id=97, 370.38374pt x 252.945pt>
    525503File: plots/Markup_density_Intel.pdf Graphic file (type pdf)
    526504
    527 <use plots/Markup_density_Intel.pdf>) (./07-avx.tex [9 <./plots/Parabix2_all_pl
    528 atform.pdf>] <plots/InsMix.pdf, id=190, 744.7825pt x 261.97874pt>
     505<use plots/Markup_density_Intel.pdf> [8 <./plots/corei3_TOT.pdf> <./plots/corei
     5063_energy.pdf> <./plots/Parabix2_all_platform.pdf>]
     507<plots/InsMix.pdf, id=155, 744.7825pt x 261.97874pt>
    529508File: plots/InsMix.pdf Graphic file (type pdf)
    530 
    531 <use plots/InsMix.pdf> <plots/avx.pdf, id=191, 424.58624pt x 212.795pt>
     509 <use plots/InsMix.pdf>)
     510(./07-avx.tex [9 <./plots/arm_TOT.pdf> <./plots/Markup_density_Arm.pdf> <./plot
     511s/Markup_density_Intel.pdf>] <plots/avx.pdf, id=186, 424.58624pt x 212.795pt>
    532512File: plots/avx.pdf Graphic file (type pdf)
    533513
    534514<use plots/avx.pdf>
    535 Overfull \hbox (7.22688pt too wide) in paragraph at lines 74--75
    536  []
    537  []
    538 
    539 ) (./09-pipeline.tex [10 <./plots/arm_TOT.pdf> <./plots/Markup_density_Arm.pdf>
    540  <./plots/Markup_density_Intel.pdf> <./plots/avx.pdf>]
    541 Overfull \hbox (9.70384pt too wide) in paragraph at lines 32--41
    542  []
    543  []
    544 
    545 
     515Overfull \hbox (7.22688pt too wide) in paragraph at lines 104--105
     516 []
     517 []
     518
     519) (./09-pipeline.tex [10 <./plots/InsMix.pdf> <./plots/avx.pdf>]
    546520Underfull \hbox (badness 1072) in paragraph at lines 76--85
    547 []\OT1/ptm/m/n/10 Figure 15[] demon-strates the per-for-mance im-prove-ment
    548  []
    549 
    550 <plots/pipeline.pdf, id=237, 471.7625pt x 275.0275pt>
     521[]\OT1/ptm/m/n/10 Figure 14[] demon-strates the per-for-mance im-prove-ment
     522 []
     523
     524<plots/pipeline.pdf, id=219, 471.7625pt x 275.0275pt>
    551525File: plots/pipeline.pdf Graphic file (type pdf)
    552526 <use plots/pipeline.pdf>
     
    555529 []
    556530
    557 ) (./10-related.tex [11 <./plots/InsMix.pdf> <./plots/pipeline.pdf>])
    558 (./11-conclusions.tex) (./final.bbl
     531) (./10-related.tex) (./11-conclusions.tex) [11 <./plots/pipeline.pdf>]
     532(./final.bbl
    559533Underfull \hbox (badness 1137) in paragraph at lines 17--22
    560534[]\OT1/ptm/m/n/9 R. Bertran, M. Gon-za-lez, X. Mar-torell, N. Navarro, and
     
    576550 []
    577551
    578 [12]
    579552Missing character: There is no à in font ptmr7t!
    580553Missing character: There is no š in font ptmr7t!
    581 ) [13
    582 
    583 ] (./final.aux) )
     554) [12] (./final.aux) )
    584555Here is how much of TeX's memory you used:
    585  3946 strings out of 495061
    586  55240 string characters out of 1182622
    587  121364 words of memory out of 3000000
    588  6953 multiletter control sequences out of 15000+50000
    589  68455 words of font info for 164 fonts, out of 3000000 for 9000
    590  31 hyphenation exceptions out of 8191
     556 3933 strings out of 493848
     557 54924 string characters out of 1152822
     558 121025 words of memory out of 3000000
     559 7038 multiletter control sequences out of 15000+50000
     560 69892 words of font info for 168 fonts, out of 3000000 for 9000
     561 717 hyphenation exceptions out of 8191
    591562 38i,12n,38p,1456b,370s stack positions out of 5000i,500n,10000p,200000b,50000s
    592 {/usr/share/texmf-texlive/fonts/enc/dvips/base/8r.enc}</usr/sh
    593 are/texmf-texlive/fonts/type1/public/amsfonts/cm/cmmi10.pfb></usr/share/texmf-t
    594 exlive/fonts/type1/public/amsfonts/cm/cmr10.pfb></usr/share/texmf-texlive/fonts
    595 /type1/public/amsfonts/cm/cmsy10.pfb></usr/share/texmf-texlive/fonts/type1/publ
    596 ic/amsfonts/cm/cmtt10.pfb></usr/share/texmf-texlive/fonts/type1/public/amsfonts
    597 /cm/cmtt8.pfb></usr/share/texmf-texlive/fonts/type1/urw/courier/ucrb8a.pfb></us
    598 r/share/texmf-texlive/fonts/type1/urw/courier/ucrr8a.pfb></usr/share/texmf-texl
    599 ive/fonts/type1/urw/symbol/usyr.pfb></usr/share/texmf-texlive/fonts/type1/urw/s
    600 ymbol/usyr.pfb></usr/share/texmf-texlive/fonts/type1/urw/times/utmb8a.pfb></usr
    601 /share/texmf-texlive/fonts/type1/urw/times/utmr8a.pfb></usr/share/texmf-texlive
    602 /fonts/type1/urw/times/utmri8a.pfb>
    603 Output written on final.pdf (13 pages, 553842 bytes).
     563{/usr/share/texmf-texlive/fonts/enc/dvips/base/8r.enc}</u
     564sr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmmi10.pfb></usr/share/te
     565xmf-texlive/fonts/type1/public/amsfonts/cm/cmr10.pfb></usr/share/texmf-texlive/
     566fonts/type1/public/amsfonts/cm/cmsy10.pfb></usr/share/texmf-texlive/fonts/type1
     567/public/amsfonts/cm/cmtt10.pfb></usr/share/texmf-texlive/fonts/type1/public/ams
     568fonts/cm/cmtt8.pfb></usr/share/texmf-texlive/fonts/type1/urw/courier/ucrb8a.pfb
     569></usr/share/texmf-texlive/fonts/type1/urw/courier/ucrr8a.pfb></usr/share/texmf
     570-texlive/fonts/type1/urw/symbol/usyr.pfb></usr/share/texmf-texlive/fonts/type1/
     571urw/symbol/usyr.pfb></usr/share/texmf-texlive/fonts/type1/urw/times/utmb8a.pfb>
     572</usr/share/texmf-texlive/fonts/type1/urw/times/utmr8a.pfb></usr/share/texmf-te
     573xlive/fonts/type1/urw/times/utmri8a.pfb>
     574Output written on final.pdf (12 pages, 517731 bytes).
    604575PDF statistics:
    605  311 PDF objects out of 1000 (max. 8388607)
     576 275 PDF objects out of 1000 (max. 8388607)
    606577 0 named destinations out of 1000 (max. 500000)
    607  71 words of extra memory for PDF output out of 10000 (max. 10000000)
    608 
     578 61 words of extra memory for PDF output out of 10000 (max. 10000000)
     579
  • docs/HPCA2012/final_ieee/final.tex

    r1737 r1743  
    2828%\renewcommand{\bottomfraction}{.5}      % instead of .3
    2929\setcounter{topnumber}{3}       % allow lots of floats at top of page
    30 \addtolength{\abovecaptionskip}{-10pt} %reduce space above captions
     30\addtolength{\abovecaptionskip}{-5pt} %reduce space above captions
     31\addtolength{\belowcaptionskip}{-5pt} %reduce space above captions
    3132
    3233% reduce space before \paragraph:
     
    266267
    267268\section*{Acknowledgment}
    268 The authors would like to thank...
     269We would like to thank the anonymous reviewers and our shepherd,
     270Martha Kim, for suggestions and feedback that helped to improve this
     271paper.
    269272
    270273% tighten spacing:
    271274\let\oldthebibliography\thebibliography
    272 \def\thebibliography#1{\oldthebibliography{#1}\parsep3pt\itemsep-1pt}
     275\def\thebibliography#1{\oldthebibliography{#1}\parsep5pt\itemsep0pt}
    273276{
    274  \footnotesize
     277\setstretch{1}
     278\footnotesize
    275279\bibliographystyle{ieee/latex8}
    276280 \bibliography{reference}
  • docs/HPCA2012/final_ieee/ieee/latex8.sty

    r1737 r1743  
    141141
    142142\long\def\@makecaption#1#2{
    143    \vskip 10pt
     143   \vskip 40pt
    144144   \setbox\@tempboxa\hbox{\tenhv\noindent #1.~#2}
    145145   \setlength{\@ctmp}{\hsize}
  • docs/HPCA2012/final_ieee/preamble-submit.tex

    r1733 r1743  
    11\documentclass[12pt,letterpaper]{article}
    22\usepackage{setspace}
    3 \usepackage{latex/iccv}
    43\usepackage{fullpage}
    54\doublespacing
Note: See TracChangeset for help on using the changeset viewer.