Changeset 1774 for docs


Ignore:
Timestamp:
Dec 13, 2011, 4:50:42 PM (7 years ago)
Author:
lindanl
Message:

minor changes

Location:
docs/HPCA2012/final_ieee
Files:
15 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/final_ieee/00-abstract.tex

    r1752 r1774  
    4949
    5050Modern applications employ text files widely for providing data
    51 storage in readable format for applications ranging from database
     51storage in a readable format for applications ranging from database
    5252systems to mobile phones. Traditional text processing tools are built
    53 around a byte-at-a-time sequential processing model, and introduce
     53around a byte-at-a-time sequential processing model that introduces
    5454significant branch and cache miss penalties.  Recent work has
    5555explored an alternative, transposed representation of text, Parabix (Parallel Bit
    5656Streams), to accelerate scanning and parsing using SIMD facilities.
    57 This paper further advocates and develops Parabix as a general framework
     57This paper advocates and develops Parabix as a general framework
    5858and toolkit, describing the software toolchain and run-time support
    5959that allows applications to exploit modern SIMD instructions for high
     
    6969Parabix exploits intra-core SIMD hardware and demonstrates
    70702$\times$--7$\times$ speedup and 4$\times$ improvement in energy
    71 efficiency compared to two widely used conventional software parsers,
     71efficiency when compared with two widely used conventional software parsers,
    7272Expat and Apache-Xerces. SIMD implementations across three
    7373generations of x86 processors are studied including the new \SB{}.
    7474The 256-bit AVX technology in Intel \SB{} is compared with the
    75 well-established 128-bit
     75well established 128-bit
    7676SSE technology to analyze the benefits and challenges of 3-operand
    7777instruction formats and wider SIMD hardware.  Finally,
     
    7979that thread-level parallelism enables the application to exploit SIMD units scattered
    8080across the different cores, achieving improved performance (2$\times$ on 4
    81 cores) at same energy levels as the single-thread version for the XML
    82 application.
     81cores) while maintaining single-threaded energy levels.
  • docs/HPCA2012/final_ieee/01-intro.tex

    r1768 r1774  
    11\section{Introduction}
    22Modern applications ranging from web search to analytics are mainly
    3 data centric operating over large swaths of data. Information expansion
     3data centric operating over large swaths of information. Information expansion
    44and diversification of data has resulted in multiple textual storage
    55formats.  Of these, XML is one of the most widely used standards, providing
     
    7575
    7676
    77 In this paper, we generalize parallel bitstreams and develop the
     77In this paper, we generalize parallel bit streams and develop the
    7878Parabix programming framework to help programmers build text
    79 processing appliations. The programmers specify the operations on
    80 unbounded character lists using bitstreams in a python environment,
    81 while our code generation and runtime translate them into low-level
     79processing applications. Programmers specify operations on
     80unbounded character lists using bit streams in a python environment.
     81Our code generation and runtime system translates them into low-level
    8282C++ routines.  The Parabix routines exploit the SIMD extensions on
    8383commodity processors (SSE/AVX on x86, Neon on ARM) to process hundreds
    84 of character positions in an input stream simultaneously dramatically
     84of character positions in an input stream simultaneously, and dramatically
    8585improving the execution efficiency. We describe the overall Parabix
    86 tool chain, a novel execution framework and a software build environment
     86tool chain, a novel execution framework and software build environment
    8787that enables text processing applications to effectively exploit
    8888commodity multicores.
     
    9292architectures.
    9393Figure~\ref{perf-energy} showcases the overall efficiency of our
    94 framework, dramatically improving both performance and
     94framework and dramatic improvements in both performance and
    9595energy efficiency. The Parabix-XML parser exploits
    96 the bitstream technology to dramatically reduce branches in the
    97 parsing routines resulting in a more efficient pipeline. It also
     96the bit stream technology to dramatically reduce branches in
     97parsing routines and realize a more efficient pipeline execution. It also
    9898substantially improves register utilization which minimizes energy
    9999wasted on cache misses and data transfers.\footnote{The actual energy consumption of the XML
     
    104104
    1051051) We outline the Parabix architecture, code-generation tool chain and
    106 runtime environment and describe how it may be used to produce
     106runtime environment; and describe how it may be used to produce
    107107efficient XML parser implementations on a variety of commodity
    108108processors.  While studied in the context of XML parsing, the Parabix
     
    134134architecture, tool chain and runtime environment.
    135135Section~\ref{section:parser} describes the design of an XML parser
    136 based on the Parabix framework.  Section~\ref{section:baseline}
     136based on the Parabix framework. Section \ref{section:methodology} details
     137our evaluation framework. Section~\ref{section:baseline}
    137138presents a detailed performance analysis of Parabix on a
    138139\CITHREE\ system using hardware performance counters.
  • docs/HPCA2012/final_ieee/02-background.tex

    r1737 r1774  
    8888parsing is that processing an XML document requires at least one
    8989conditional branch per byte of source text.  For example, Xerces-C,
    90 which forms the foundation for widely deployed the Apache XML project
     90which forms the foundation for the widely deployed the Apache XML project
    9191\cite{xerces}, uses a series of nested switch statements and
    9292state-dependent flag tests to control the parsing logic of the
    9393program. Xerces's complex data dependent control flow requires between
    94 6 -- 13 branches per byte of XML input, depending on the markup in
     946--13 branches per byte of XML input, depending on the markup in
    9595the file (details in Section~\ref{section:XML-branches}).  Cache
    9696utilization is also significantly reduced due to the manner in which
     
    9999XML data. In general, while microarchitectural improvements may help the
    100100parser tide over some of these challenges (e.g., cache misses), the
    101 fundamental data and control flow in the parsers are ill suited for
     101fundamental data and control flow in byte-at-a-time parsers are ill suited for
    102102commodity processors and experience significant overhead.
    103103
  • docs/HPCA2012/final_ieee/03-research.tex

    r1751 r1774  
    1818traditional text processing models is in how Parabix represents the
    1919source data.  Given a traditional byte-oriented text stream, Parabix
    20 first transposes the text data to a transform domain consisting of 8
    21 parallel bit streams, known as {\em basis bit streams}.  In essence,
    22 each basis bit stream $b_{k}$ represents the stream of $k$-th bit of
     20first transposes the text data into a transform representation consisting of 8
     21parallel bit streams, known as {\em basis bit streams} wherein
     22basis bit stream $b_{k}$ represents the stream of $k$-th bit of
    2323each byte in the source text.  That is, the $k$-th bit of $i$-th byte
    24 in the source text is in the $i$-th (bit) position of the $k$-th basis
     24in the source text is in one-to-one correspondence with the $i$-th bit of the $k$-th basis
    2525bit stream, $b_{k}$.  For example, in Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string ``{\ttfamily
    2626  b7\verb`<`A}'' is represented as 8 basis bit streams, $\tt b_{0
    27   \ldots 7}$. The bits used to construct $\tt b_7$ have been
     27  \ldots 7}$. The bits used to construct $\tt b_7$ are
    2828highlighted in this example.
    2929
     
    5353use the 128-bit SIMD registers commonly found on commodity processors
    5454(e.g. SSE on Intel) to process 128 byte positions at a time using
    55 bitwise logic, shifting and other operations.
     55bitwise logical, shift and arithmetic operations.
    5656
    5757Just as forward and inverse Fourier transforms are used to transform
    5858between the time and frequency domains in signal processing, bit
    5959stream transposition and inverse transposition provides ``byte space''
    60 and ``bit space'' views of text.  The goal of the Parabix framework is
     60and ``bit space'' domains for text.  The goal of the Parabix framework is
    6161to support efficient text processing using these two equivalent
    62 representations in the same way that efficient signal processing
     62representations in the analogous way that efficient signal processing
    6363benefits from the use of the frequency domain in some cases and the
    6464time domain in others.
     
    101101metadata parsing.
    102102% For example, in a CSV file, any `,' or `\textbackslash n' can indicate the start of a new column or row respectively.
    103 For example, in XML, any opening angle bracket character, `\verb`<`', may indicate that we are starting a new markup tag.
     103For example, in XML, any opening angle bracket character, `\verb`<`', may indicate the start of a markup tag.
    104104Traditional byte-at-a-time parsers find these characters by comparing the value of each byte with a set
    105105of known significant characters and branching appropriately when one is found, typically using an if or switch statement.
     
    109109% However, a `<' is legal within an XML comment so not every `<' necessarily means that we are opening a new tag.
    110110
    111 Character-class bit streams allow us to perform up to 128
    112 ``comparisons'' in parallel with a single operation by using a series
    113 of boolean-logic operations \footnote{$\land$, $\lor$ and $\lnot$
    114   denote the boolean AND, OR and NOT operations.}  to merge multiple
    115 basis bit streams into a single character-class stream that marks the
    116 positions of key characters with a $1$. For example, a character is
     111Character-class bit streams enable up to 128
     112``comparisons'' in parallel through a
     113series of boolean-logic operations \footnote{$\land$, $\lor$ and $\lnot$
     114  denote the boolean AND, OR and NOT operations.}  that merge multiple basis
     115bit streams into a single character-class stream that marks the
     116positions of key characters. For example, a character is
    117117an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land
    118 b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.  Classes of
    119 characters can be found with similar formulas.  For example, a
     118b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.  Addition character
     119classes can be determined with similar formulas.  For example, a
    120120character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1)
    121121\land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An
     
    135135To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using
    136136a mixture of both boolean logic and arithmetic operations. Lexical bit streams typically mark multiple current parsing positions.
    137 Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel.
     137Unlike the single-cursor approach of traditional text parsers, the marking of multiple lexical items allows Parabix to process multiple items in parallel.
    138138Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness
    139 issues found during the parsing process. The presence of a $\tt 1$ in an error stream indicates that the lexical stream cannot be
    140 trusted to be completely accurate and it may be necessary to perform some sequential parsing on that section to determine the cause and severity
    141 of the error. %How errors are handled depends on the logical implications of the error and go beyond the scope of this paper.
    142 
    143 To form lexical bit streams, we have to introduce a few new operations: {\tt Advance} and {\tt ScanThru}.
     139issues found during the parsing process. A $\tt 1$ bit in an error stream indicates the precense of a potential error that may require further
     140processing to determine cause and severity.
     141
     142To form lexical bit streams we introduce two new operations: {\tt Advance} and {\tt ScanThru}.
    144143The {\tt Advance} operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits,
    145144and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right.
    146 {\tt ScanThru} accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent
    147 $\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$. 
     145{\tt ScanThru} accepts two input parameters, $c$ and $m$, where $c$ denotes an initial
     146set of cursor positions, and $m$ denotes a set of ``marked'' lexical item positions.
     147The ScanThru operation determines the cursor positions immediately
     148following any run of marker positions by calculating $(c + m) \land \lnot m$. 
    148149For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to find all
    149150instances of it in the source text.
     
    152153token. By computing $E_{0}$, the parser notes that ``\verb`<>`'' does not match the expected pattern. To find the end positions of each token,
    153154the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each
    154 token ends with a `\verb`>`' and discovers that ``\verb`<error]`'' too fails to match the expected pattern.
     155token ends with a `\verb`>`' and discovers that ``\verb`<error]`'' also fails to match the expected pattern.
    155156With additional post bit-stream processing, the erroneous cursors in $L_{0}$ and $L_{1}$ can be removed; the details
    156157of which go beyond the scope of this paper.
     
    188189% details, refer to the technical report \cite{Cameron2010}.
    189190
    190 Using this parallel bit stream approach, conditional branch statements
     191Using this parallel bit stream approach, the vast majority of conditional branches
    191192used to identify key positions and/or syntax errors at each
    192193parsing position are mostly eliminated, which, as Section
     
    211212application.  Input is specified using a character class syntax
    212213adapted from the standard regular expression notations.  Output is a
    213 minimized set of three-address bitwise operations to compute each of
     214minimized set of three-address bitwise operations that compute each of
    214215the character classes from the basis bit streams.
    215 
    216 
    217 For example, Figure \ref{fig:CCC} shows the input and output produced
     216Figure \ref{fig:CCC} shows the input and output produced
    218217by the character class compiler for the example of \verb`[0-9]`
    219218discussed in the previous section.  The output operations may be
     
    305304carry variable declarations that allow the results of
    306305{\tt Advance} and {\tt ScanThru} operations to be carried over from
    307 block to block.  A separate carry variable is required for every
     306block-to-block.  A separate carry variable is required for every
    308307{\tt Advance} or {\tt ScanThru} operation.  A function containing
    309308such operations is translated into a public C++ class (struct),
     
    316315specific architecture and Carry Queue representation.
    317316The unbounded bit stream {\tt Advance} and {\tt ScanThru}
    318 operations are translated into block-by-block equivalents
     317operations are translated into block-wise equivalents
    319318with explicit carry-in and carry-out processing. 
    320319At the end of each block, the {\tt CarryQ\_Adjust}
     
    325324(if and while constructs) which involves additional
    326325carry-test insertion in control branches.
    327 Explaining the full details of the translation
     326A complete explanation of the Pablo translation
    328327is beyond the scope of this paper.
    329328
     
    332331The Parabix architecture also includes runtime libraries that support
    333332a machine-independent view of basic SIMD operations, as well as a set
    334 of core function libraries.  For machine-independence, we program all
    335 operations using an abstract SIMD machine.  The abstract machine
     333of core function libraries. For portability, we program all SIMD operations against
     334an abstract SIMD machine representation, parameterized on SIMD
     335field and register width. The abstract machine
    336336supports all power-of-2 field widths up to the full SIMD register
    337337width on a target machine.  Let $w = 2k$ be the field width in
     
    350350currently take advantage of the 128-bit Altivec operations on the
    351351Power PC, 64-bit MMX and 128-bit SSE operations on previous generation
    352 Intel platforms, the latest 256-bit AVX extensions on the Sandybridge
     352Intel platforms, the latest 256-bit AVX extensions on the \SB{}
    353353processor, and finally the 128-bit \NEON{} operations on ARM.
    354354
  • docs/HPCA2012/final_ieee/03b-research.tex

    r1733 r1774  
    1010\end{figure*}
    1111
    12 This section describes the implementation of the Parabix XML parser.
    13 Figure \ref{parabix_arch} shows its overall structure set up for
    14 well-formedness checking. 
     12This section describes the implementation of the Parabix XML parser
     13for well-formedness checking. Figure \ref{parabix_arch} shows its overall structure.
    1514The input file is processed using 11 functions organized into 7 modules. 
    1615In the first module, {\tt Read\_Data}, the input file is loaded into the
  • docs/HPCA2012/final_ieee/04-methodology.tex

    r1743 r1774  
    66against two widely available open-source parsers: Xerces-C \cite{xerces} and Expat \cite{expat}.
    77Each of the parsers is evaluated on the task of implementing the
    8 parsing and well-formedness validation requirements of the full
     8parsing and well-formedness checking requirements of the full
    99XML 1.0 specification\cite{TR:XML}.
    1010Xerces-C version 3.1.1 (SAX) is a validating XML
     
    7272
    7373\paragraph{Platform Hardware:}
    74 SSE extensions have been available on commodity Intel processors for
     74SSE SIMD extensions have been available on commodity Intel processors for
    7575over a decade since the Pentium III. They have steadily evolved with
    7676improvements in instruction latency, cache interface, register
    7777resources, and the addition of domain specific instructions. Here we
    7878investigate SIMD extensions across three different generations of
    79 intel processors (hardware details in Table \ref{hwinfo}). We compare
    80 the energy and performance profile of the Parabix under the platforms.
     79intel processors (hardware details given in Table \ref{hwinfo}). We compare
     80the energy and performance profile of the Parabix parser on each of the platforms.
    8181We also analyze the implementation specifics of SIMD extensions under
    82 various microarchitectures and the newer AVX extensions supported by
    83 Sandybridge.
     82various microarchitectures as well as the newer AVX extensions supported by \SB{}.
    8483
    8584
    86 We investigated the execution profiles of each XML parser
     85We investigate the execution profiles of each XML parser
    8786using the performance counters found in the processor.
    88 We chose several key hardware events that provide insight into the profile of each
     87We choose several key hardware events that provide insight into the profile of each
    8988application and indicate if the processor is doing useful work
    9089~\cite{bellosa2001, bertran2010}. 
    91 The set of events included in our study are: Branch instructions, Branch mispredictions,
    92 Integer instructions, SIMD instructions, and Cache misses. In
     90The set of events included in our study are: branch instructions, branch mispredictions,
     91integer instructions, SIMD instructions, and cache misses. In
    9392addition, we characterize the SIMD operations and study the type and
    9493class of SIMD operations using the Intel Pin binary instrumentation
     
    108107monitored by an Agilent 34410a digital multimeter at the granularity
    109108of 100 samples per second. This measurement captures the instantaneous
    110 power to the processor package, including cores, caches, northbridge
     109power to the processor package, including the cores, caches, northbridge
    111110memory controller, and the quick-path interconnects. We obtain samples
    112111throughout the entire execution of the program and then calculate overall
  • docs/HPCA2012/final_ieee/05-corei3.tex

    r1768 r1774  
    22\label{section:baseline}
    33In this section we analyze the energy and performance characteristics
    4 of the Parabix-based XML parser against the software XML parsers,
     4of the Parabix XML parser against the software XML parsers,
    55Xerces and Expat. For our baseline evaluation, we compare all the XML
    66parsers on the \CITHREE{}.
     
    1212
    1313
    14 Table \ref{cache_misses} shows the cache misses per kilobyte of input
     14Table \ref{cache_misses} shows cache misses per kilobyte of input
    1515data. Analytically, the cache misses for the Expat and Xerces parsers
    1616represent a 0.5 cycle per XML byte cost.\footnote{The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is
     
    3535
    3636
    37 This overhead does not
    38 necessarily impact the overall performance of these parsers as they
    39 experience additional overheads related to branch mispredictions.
    40 Compared to Xerces and Expat, the data organization of Parabix-XML
     37This overhead has little impact on the overall performance of these parsers
     38as they experience additional overheads related to branch mispredictions.
     39Wne compared with Xerces and Expat, the data organization of Parabix-XML
    4140significantly reduces the overall cache miss rate; specifically, there
    4241were $7\times$ and $15\times$ fewer L1 and L2 cache misses compared to
     
    6261The performance of traditional parsers is limited by their branch
    6362behavior.  Xerces experiences up to 13 branches per input XML
    64 character on the high markup files; Expat experiences up to 8 branches
    65 per XML character.  In Parabix-XML, the use of SIMD operations
    66 eliminates many branches.  Most conditional branches can be replaced
     63character on the high markup files whereas Expat experiences up to 8.
     64In Parabix-XML, the use of SIMD operations eliminates a significant proportion
     65of the overall branches.  Most conditional branches can be replaced
    6766with bitwise operations, which can process up to 128 characters worth
    6867of branches with one operation or with a series of logical predicate
     
    7170
    7271
    73 The high miss prediction rate in conventional parsers is a significant
    74 overhead. The cost of a single branch misprediction is on the order of
    75 10s of CPU cycles spent to restart the processor pipeline on a
    76 misprediction. Parabix-XML is nearly branch free and exhibits minimal
    77 dependence on the source markup density. Specifically, it experiences
    78 between 19.5 and 30.7 branch mispredictions per kB of XML
    79 data. Conversely, the cost of branch mispredictions for the Expat
    80 parser can be over 7 cycles per XML byte (see Figure~\ref{corei3_BM})
     72The high branch misprediction rate of conventional parsers is a
     73significant overhead, with the cost of a single branch mispredic-
     74tion on the order of 10s of CPU cycles spent to restart the processor
     75pipeline. Parabix-XML is nearly branch free and exhibits minimal
     76dependence on the source markup density. Specifically, our study
     77demonstrates that Parabix experiences between 19.5 and
     7830.7 branch mispredictions per kB of XML data. Conversely,
     79the cost of branch mispredictions for the Expat parser
     80can be over 7 cycles per XML byte (see Figure~\ref{corei3_BM})
    8181--- which exceeds the average latency of a byte processed by
    8282Parabix-XML.
    8383
    84 Unfortunately, it is difficult to reduce the branch misprediction rate
    85 of traditional XML parsers due to: (1) the variable length nature of
     84The branch misprediction rate of traditional XML parsers is difficult to reduce due to
     85a number of factors: (1) the variable length nature of
    8686the syntactic elements contained within XML documents; (2) input data
    8787dependent characteristic, and (3) the extensive set of syntax
     
    102102}
    103103\end{center}
    104 \caption{Branch Mispredictions on the \CITHREE{}. (/ 1kB input)}
     104\caption{Branch Mispredictions on the \CITHREE{} per kB input}
    105105\label{corei3_BM}
    106106
     
    168168on data-oriented input.  Traditional parsers can be dramatically
    169169slowed by dense markup but Parabix-XML is relatively unaffected.
    170 Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 before
    171 processing it; this requires several cycles per byte. However,
     170Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 prior to
     171processing; this requires several cycles per byte. However,
    172172transcoding using parallel bit streams is significantly faster and
    173173requires less than a single cycle per byte.
     
    175175
    176176 The energy trends shown in Figure \ref{corei3_energy} reveal an
    177  interesting trend. Parabix consumes substantially less energy than
     177 interesting result. Parabix consumes substantially less energy than
    178178 the other parsers. Parabix consumes 50 to 75 nJ per byte while Expat
    179179 and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte
    180  respectively. Parabix-XML experiences minimal increase in power
    181  ($\sim5\%$) compared to the conventional parsers. While the SIMD
     180 respectively. Parabix-XML experiences minimal increase in power consumption
     181 ($\sim5\%$) as compared to the conventional parsers. While the SIMD
    182182 functional units are significantly wider than the scalar
    183183 counterparts, register width and functional unit power account only
    184  for a small fraction of the overall power consumption in a processor
    185  pipeline. Parabix amortizes the fetch and data access overheads over
     184 for a small fraction of the overall power consumption in a pipeline
     185 processor. Parabix amortizes the fetch and data access overheads over
    186186 multiple data parallel operations. Although Parabix requires
    187187 slightly more power (per instruction), the processing time of Parabix
  • docs/HPCA2012/final_ieee/06-scalability.tex

    r1743 r1774  
    55In this section, we study the performance of the XML parsers across
    66three generations of Intel architectures.  Figure \ref{Parabix_all_platform}
    7 shows the average execution time of Parabix-XML (over all workloads).  We analyze the
     7shows the average execution time of Parabix-XML over all workloads.  We analyze the
    88execution time in terms of SIMD operations that operate on ``bit streams''
    9 (\textit{bit-space}) and scalar operations that perform ``post
    10 processing'' on the original source bytes.  In Parabix-XML, a significant
    11 fraction of the overall execution time is spent on SIMD operations. 
     9in \textit{bit-space} and scalar operations used to perform ``post
     10processing'' operations on the source input.
    1211
    1312Our results demonstrate that Parabix-XML's optimizations complement
     
    1514\CITHREE{} has a 40\% performance increase over \CO{};
    1615similarly, \SB{} has a 20\% improvement compared to
    17 \CITHREE{}. These gains appear to be independent of the markup
    18 density of the input file.
     16\CITHREE{}. These gains appear independent of the markup.
    1917Postprocessing operations
    2018demonstrate data dependent variance. Performance on the \CITHREE{} increases by
     
    2321\CITHREE\ improves performance only by 29\% over \CO\ while \SB\
    2422improves performance by less than 6\% over \CITHREE{}. Note that the
    25 gains of \CITHREE\ over \CO\ includes an improvement both in the clock
    26 frequency and microarchitecture improvements while \SB{}'s gains can
    27 be mainly attributed to the architecture.
     23gains of \CITHREE\ over \CO\ includes an improvement both in clock
     24frequency and microarchitecture while \SB{}'s gains are mainly attributed to the architecture.
    2825Figure \ref{Parabix_all_platform} also shows the average power consumption of
    29 Parabix-XML over each workload and as executed on each of the processor
    30 cores: \CO{}, \CITHREE\ and \SB{}.  Each
    31 generation of processor seem to bring with them 25--30\% improvement
     26Parabix-XML over each workload and as executed on each of the processors:
     27\CO{}, \CITHREE\ and \SB{}.  Each generation of processor appears to bring a 25--30\% improvement
    3228in power consumption over the previous generation. Parabix-XML on \SB\ consumes 72\%--75\% less energy than it did on \CO{}.
    3329
     
    6864\def\CORTEXA8{Cortex-A8}
    6965
    70 \subsection{Parabix on Mobile processors}
     66\subsection{Parabix on Mobile Processors}
    7167\label{section:scalability:\NEON{}}
    7268Our experience with Intel processors led us to
    7369question whether mobile processors with SIMD support, such as the ARM \CORTEXA8{},
    7470could benefit from Parabix technology. ARM \NEON{} provides a 128-bit SIMD
    75 instruction set similar in functionality to Intel SSE3 instruction
     71instruction set similar in functionality to the Intel SSE3 instruction
    7672set. In this section, we present our performance comparison of a
    7773\NEON{}-based port of Parabix versus the Expat parser. Xerces is excluded
     
    8177The platform we use is the Samsung Galaxy Android Tablet that houses a
    8278Samsung S5PC110 ARM \CORTEXA8{} 1Ghz single-core, dual-issue,
    83 superscalar microprocessor. It includes a 32kB L1 data cache and a
     79superscalar microprocessor. This device includes a 32kB L1 data cache and a
    8480512kB L2 shared cache.  Migration of Parabix-XML to the Android platform
    8581only required developing a Parabix runtime library for ARM \NEON{}.
     
    8783directly. However, a small subset of key SIMD instructions (e.g., bit
    8884packing) did not exist on \NEON{}. In such cases, the
    89 logical equivalent of those instructions was emulated using the available
     85logical equivalents of those instructions were emulated using the available
    9086ISA. The resulting application was cross-compiled for
    9187Android using the Android NDK.
     
    108104Expat and Parabix for the various input workloads on the \CORTEXA8{};
    109105Figure~\ref{relative_performance_intel} plots the performance for
    110 \CITHREE{}. The results demonstrate that that the execution time of
     106\CITHREE{}. The results demonstrate that the execution time of
    111107each parser varies in a linear fashion with respect to the markup
    112108density of the file. On the both \CORTEXA8{} and \CITHREE{} both
     
    120116implemented as a coprocessor on the \CORTEXA8{}, which imposes a higher
    121117overhead for applications that frequently inter-operate between scalar
    122 and SIMD registers. Future performance enhancement to ARM \NEON{} that
    123 implement the \NEON{} within the core microarchitecture could
    124 substantially improve the efficiency of Parabix-XML.
     118and SIMD registers. Future performance enhancements to the \NEON{} ISA on
     119ARM could substantially improve the efficiency of Parabix.
    125120
    126121
  • docs/HPCA2012/final_ieee/07-avx.tex

    r1751 r1774  
    1313
    1414\subsection{3-Operand Form}
    15 In addition to widening the 128-bit operations to 256-bit,
     15In addition to widening the 128-bit operations to 256-bit operations,
    1616 AVX technology uses a nondestructive 3-operand instruction
    1717format. Previous SSE implementations used a destructive 2-operand
     
    1919as both a source and destination register. As such, 2-operand instructions that require the
    2020value of both $a$ and $b$, must either copy an additional register
    21 value beforehand, or reconstitute or reload a register value
     21value beforehand, or reconstitute a register value
    2222afterwards to recover the value.  With the 3-operand format, output
    2323may now be directed to the third register independently of the source
     
    2525copy or reconstitute operand values, a considerable reduction
    2626in instructions required for unloading from and loading into
    27 registers.  AVX technology makes available the 3-operand form for both
    28 the new 256-bit AVX and as the 128-bit SSE operations.
     27registers is achieved.  AVX technology makes available the 3-operand form for both
     28the new 256-bit AVX as well as the 128-bit SSE operations.
    2929
    3030\subsection{256-bit Operations}
     
    4747leverage the 256-bit AVX instructions wherever possible and to simulate
    4848the remaining operations using pairs of 128-bit operations. Figure
    49 \ref{insmix} shows the reduction in instruction counts achieved in
    50 these two versions. For each workload, the base instruction count of
    51 the Parabix binary compiled in 2-operand SSE-only mode is indicated by ``sse;''
     49\ref{insmix} shows the reduction in instruction count achieved in
     50each version. For each workload, the base instruction count of
     51the Parabix binary compiled in 2-operand SSE-only mode is indicated by ``sse'';
    5252the version that only takes advantage of the AVX 3-operand mode is
    53 labeled ``128-bit avx,'' and the version uses the 256-bit
    54 operations wherever possible is labeled ``256-bit avx.''  The
     53labeled ``128-bit avx'', and the version uses the 256-bit
     54operations wherever possible is labeled ``256-bit avx''.  The
    5555instruction counts are divided into three classes: ``non-SIMD''
    5656operations are the general purpose instructions.  The ``bitwise SIMD''
     
    6666remains relatively constant with each workload.  As expected,
    6767the number of bitwise SIMD operations remains the same
    68 for both SSE and 128-bit while dropping dramatically when operating
     68for both SSE and 128-bit AVX while dropping dramatically when operating
    6969256-bits at a time. The reduction was measured at 32\%--39\% depending
    7070on markup density of the workload. The ``other SIMD'' class
     
    8585benefits of 3-operand form seem to fully translate to performance
    8686benefits.  Based on the reduction of overall Bitwise-SIMD instructions
    87 we expected a 11\% improvement in performance.  Instead, perhaps
    88 bizarrely, the performance of Parabix in the 256-bit AVX
     87we expected a 11\% improvement in performance. 
     88Surprisingly, the performance of Parabix in the 256-bit AVX
    8989implementation does not improve significantly and actually degrades
    9090for files with higher markup density ($\sim11\%$). dew.xml, on
    91 which bitwise-SIMD instructions reduced by 39\%, saw a performance
     91which bitwise-SIMD instructions were reduced by 39\%, saw a performance
    9292improvement of 8\%.  We believe that this is primarily due to the
    9393intricacies of the first generation AVX implementation in \SB{}, with
  • docs/HPCA2012/final_ieee/09-pipeline.tex

    r1743 r1774  
    33Even if an application is infinitely parallelizable and thread
    44synchronization costs are non-existent, all applications are constrained by
    5 the power and energy overheads incurred when utilizing multiple cores:
     5the power and energy overheads incurred when utilizing multiple cores;
    66as more cores are put to work, a proportional increase in power occurs.
    77Unfortunately, due to the runtime overheads associated with
     
    1717
    1818The typical approach to handling data parallelism with multiple threads
    19 involves partitioning data uniformly across the threads. However XML
     19involves partitioning data uniformly across the threads. However, XML
    2020parsing is inherently sequential, which makes it difficult to
    2121partition the data. Several attempts have been made to address this
    22 problem using a preparsing phase to help determine the tree structure
     22problem. For example, using a preparsing phase to help determine the tree structure
    2323and to partition the XML document accordingly~\cite{dataparallel}.
    2424Another approach involved speculatively partitioning the data~\cite{Shah:2009} but
     
    5353partitioned Parabix-XML into four stages and assigned a core to
    5454each to stage. One of the key challenges was to determine which passes
    55 should be grouped together. By analyzing the latency and data dependencies of each of
    56 the passes in the single-threaded version of Parabix-XML
    57 (Column 3 in Table~\ref{pass_structure}), and assigned the passes
    58 to stages such that provided the maximal throughput.
     55should be grouped together. We analyzed the latency and data dependencies of each of the passes
     56in the single-threaded version of Parabix (Column 3 in Table~\ref{pass_structure}),
     57and assigned the passes to stages to maximize throughput.
    5958
    6059
     
    7170controlling the overall size of the ring buffer. Whenever a faster stage
    7271runs ahead, it will effectively cause the ring buffer to fill up and
    73 force that stage to stall. Experiments show that 6 entries of the
     72force that stage to stall. Experiments show that six entries of the
    7473circular buffer gives the best performance.
    7574
     
    7877single-threaded version.  The 4-threaded version is $\simeq2\times$
    7978faster compared to the single threaded version and achieves
    80 $\simeq2.7$ cycles per input byte by exploiting SIMD units of all
     79$\simeq2.7$ cycles per input byte by exploiting the SIMD units of all
    8180\SB{}'s cores.  This performance approaches the 1 cycle per byte
    8281performance of custom hardware solutions~\cite{DaiNiZhu2010}. Parabix
  • docs/HPCA2012/final_ieee/10-related.tex

    r1768 r1774  
    99% construction costs of the more flexible DOM (Document Object Model)
    1010% parsers \cite{Perkins05}.  Nicola and John specifically identified
    11 the traditional method of XML parsing as a threat to database
    12 performance and outlined a number of potential directions for
    13 improving performance \cite{NicolaJohn03}.  The commercial importance
     11XML parsing as a threat to database performance  \cite{NicolaJohn03}
     12outlines a number of potential directions for
     13improving performance.  The commercial importance
    1414of XML parsing has spurred the development of numerous multi-threaded
    1515and hardware-based approaches: Multithreaded XML techniques include
     
    2020\cite{DaiNiZhu2010}. Others have explored the design of custom
    2121hardware for bit parallel operations for text search in network
    22 processors~\cite{tan-sherwood-isca-2005}. Intel's SSE4 instructions targeted
     22processors~\cite{tan-sherwood-isca-2005}. Intel's SSE4.2 instructions targeted
    2323XML parsers, but these have not seen widespread use because of portability
    2424concerns and the programming challenges that accompany low level
     
    3131SSE2 instructions and proposed an inductive doubling instruction set
    3232~\cite{CameronLin2009}. In this paper, we have developed a generalized
    33 parabix architecture and have described the software tool chain that
     33Parabix architecture and have described the software tool chain that
    3434programmers can use to build scalable text processing applications on
    3535commodity multicores. We have explored in the detail the tradeoffs
  • docs/HPCA2012/final_ieee/11-conclusions.tex

    r1737 r1774  
    1010% Future research
    1111
    12 In this paper we presented Parabix a software runtime framework for
     12In this paper we presented Parabix, a software runtime framework for
    1313exploiting SIMD data units found on commodity processors for text
    14 processing.  The Parabix framework allows to focus on exposing the
     14processing.  The Parabix framework allows programmers to focus on exposing the
    1515parallelism in their application assuming an infinite resource
    1616abstract SIMD machine without worrying about or having to change code
    1717to handle processor specifics (e.g., 128-bit SIMD SSE vs 256-bit SIMD
    1818on AVX). We applied Parabix technology to a widely deployed
    19 application; XML parsing and demonstrate the efficiency gains that can
     19application, XML parsing and demonstrate the efficiency gains that can
    2020be obtained on commodity processors. Compared to the conventional XML
    2121parsers, Expat and Xerces, we achieve 2$\times$---7$\times$
     
    2424reduction in branches, 7$\times$---15$\times$ reduction in branch mispredictions,
    2525% ?\times$ reduction in LLC misses, and increase in data parallelism
    26 processing up to 128 characters with a single operation. We used the
    27 Parabix framework and XML parsers to study the features of the new 256
    28 bit AVX extension in Intel processors. We find that while the move to
     26and process up to 128 characters with a single operation. We used the
     27Parabix framework and XML parsers to study the features of the new 256-bit
     28AVX extension in Intel processors. We find that while the move to
    29293-operand instructions deliver significant benefit the wider
    3030operations in some cases have higher overheads compared to the
  • docs/HPCA2012/final_ieee/final.aux

    r1768 r1774  
    4141\newlabel{parsers}{{5}{5}}
    4242\@writefile{toc}{\contentsline {paragraph}{XML Parsers:}{5}}
    43 \newlabel{workloads}{{5}{5}}
    4443\citation{bellosa2001,bertran2010}
    4544\citation{clamp}
    4645\@writefile{lof}{\contentsline {figure}{\numberline {7}{\ignorespaces Parabix XML Parser Structure\relax }}{6}}
    4746\newlabel{parabix_arch}{{7}{6}}
     47\newlabel{workloads}{{5}{6}}
    4848\@writefile{toc}{\contentsline {paragraph}{XML Workloads:}{6}}
    4949\@writefile{lot}{\contentsline {table}{\numberline {1}{\ignorespaces XML Document Characteristics\relax }}{6}}
     
    6060\@writefile{toc}{\contentsline {subsection}{\numberline {6.2}Branch Mispredictions}{7}}
    6161\newlabel{section:XML-branches}{{6.2}{7}}
     62\@writefile{lof}{\contentsline {figure}{\numberline {8}{\ignorespaces Branch Mispredictions on the Core-i3{} per kB input\relax }}{7}}
     63\newlabel{corei3_BM}{{8}{7}}
    6264\@writefile{toc}{\contentsline {subsection}{\numberline {6.3}SIMD Instructions vs. Total Instructions}{7}}
    63 \@writefile{lof}{\contentsline {figure}{\numberline {8}{\ignorespaces Branch Mispredictions on the Core-i3{}. (/ 1kB input)\relax }}{7}}
    64 \newlabel{corei3_BM}{{8}{7}}
    6565\@writefile{lot}{\contentsline {table}{\numberline {4}{\ignorespaces SIMD Instruction Percentage\relax }}{7}}
    6666\newlabel{corei3_INS_p2}{{4}{7}}
    6767\@writefile{toc}{\contentsline {subsection}{\numberline {6.4}Performance and Energy Characteristics}{7}}
    68 \newlabel{corei3_TOT}{{9(a)}{8}}
    69 \newlabel{sub@corei3_TOT}{{(a)}{8}}
    70 \newlabel{corei3_energy}{{9(b)}{8}}
    71 \newlabel{sub@corei3_energy}{{(b)}{8}}
    72 \@writefile{lof}{\contentsline {figure}{\numberline {9}{\ignorespaces Performance and Energy profile of Parabix on Core i3\relax }}{8}}
    73 \@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {Performance (CPU Cycles per kB)}}}{8}}
    74 \@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {Energy Consumption ($\mu $J per kB)}}}{8}}
    7568\@writefile{toc}{\contentsline {section}{\numberline {7}Parabix on different platforms}{8}}
    7669\newlabel{section:scalability}{{7}{8}}
    7770\@writefile{toc}{\contentsline {subsection}{\numberline {7.1}Performance}{8}}
    7871\newlabel{section:scalability:intel}{{7.1}{8}}
    79 \@writefile{toc}{\contentsline {subsection}{\numberline {7.2}Parabix on Mobile processors}{8}}
    80 \newlabel{section:scalability:Neon{}}{{7.2}{8}}
    8172\@writefile{lof}{\contentsline {figure}{\numberline {10}{\ignorespaces Parabix on various hardware platforms\relax }}{8}}
    8273\newlabel{Parabix_all_platform}{{10}{8}}
     74\@writefile{toc}{\contentsline {subsection}{\numberline {7.2}Parabix on Mobile Processors}{8}}
     75\newlabel{section:scalability:Neon{}}{{7.2}{8}}
     76\@writefile{toc}{\contentsline {section}{\numberline {8}Parabix on AVX}{8}}
     77\newlabel{section:avx}{{8}{8}}
     78\newlabel{corei3_TOT}{{9(a)}{9}}
     79\newlabel{sub@corei3_TOT}{{(a)}{9}}
     80\newlabel{corei3_energy}{{9(b)}{9}}
     81\newlabel{sub@corei3_energy}{{(b)}{9}}
     82\@writefile{lof}{\contentsline {figure}{\numberline {9}{\ignorespaces Performance and Energy profile of Parabix on Core i3\relax }}{9}}
     83\@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {Performance (CPU Cycles per kB)}}}{9}}
     84\@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {Energy Consumption ($\mu $J per kB)}}}{9}}
    8385\newlabel{arm_processing_time}{{11(a)}{9}}
    8486\newlabel{sub@arm_processing_time}{{(a)}{9}}
     
    9193\@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {ARM Neon}}}{9}}
    9294\@writefile{lof}{\contentsline {subfigure}{\numberline{(c)}{\ignorespaces {Core i3}}}{9}}
    93 \@writefile{toc}{\contentsline {section}{\numberline {8}Parabix on AVX}{9}}
    94 \newlabel{section:avx}{{8}{9}}
    95 \@writefile{toc}{\contentsline {subsection}{\numberline {8.1}3-Operand Form}{9}}
    96 \@writefile{toc}{\contentsline {subsection}{\numberline {8.2}256-bit Operations}{9}}
    97 \@writefile{toc}{\contentsline {subsection}{\numberline {8.3}Performance Results}{9}}
     95\@writefile{lof}{\contentsline {figure}{\numberline {12}{\ignorespaces Parabix Instruction Counts (y-axis: Instructions per kB)\relax }}{9}}
     96\newlabel{insmix}{{12}{9}}
     97\@writefile{toc}{\contentsline {subsection}{\numberline {8.1}3-Operand Form}{10}}
     98\@writefile{toc}{\contentsline {subsection}{\numberline {8.2}256-bit Operations}{10}}
     99\@writefile{toc}{\contentsline {subsection}{\numberline {8.3}Performance Results}{10}}
     100\@writefile{lof}{\contentsline {figure}{\numberline {13}{\ignorespaces Parabix Performance (y-axis: ns per kB)\relax }}{10}}
     101\newlabel{avx}{{13}{10}}
     102\@writefile{toc}{\contentsline {section}{\numberline {9}Multithreaded Parabix}{10}}
     103\newlabel{section:multithread}{{9}{10}}
    98104\citation{dataparallel}
    99105\citation{Shah:2009}
    100 \@writefile{lof}{\contentsline {figure}{\numberline {12}{\ignorespaces Parabix Instruction Counts (y-axis: Instructions per kB)\relax }}{10}}
    101 \newlabel{insmix}{{12}{10}}
    102 \@writefile{toc}{\contentsline {section}{\numberline {9}Multithreaded Parabix}{10}}
    103 \newlabel{section:multithread}{{9}{10}}
    104 \@writefile{lof}{\contentsline {figure}{\numberline {13}{\ignorespaces Parabix Performance (y-axis: ns per kB)\relax }}{10}}
    105 \newlabel{avx}{{13}{10}}
    106 \@writefile{lot}{\contentsline {table}{\numberline {5}{\ignorespaces Stage Division\relax }}{10}}
    107 \newlabel{pass_structure}{{5}{10}}
    108106\citation{DaiNiZhu2010}
    109107\citation{NicolaJohn03}
     
    117115\citation{cameron-EuroPar2011}
    118116\citation{CameronLin2009}
     117\@writefile{lot}{\contentsline {table}{\numberline {5}{\ignorespaces Stage Division\relax }}{11}}
     118\newlabel{pass_structure}{{5}{11}}
     119\@writefile{lof}{\contentsline {figure}{\numberline {14}{\ignorespaces Average Statistic of Multithreaded Parabix\relax }}{11}}
     120\newlabel{multithread_perf}{{14}{11}}
    119121\@writefile{toc}{\contentsline {section}{\numberline {10}Related Work}{11}}
    120122\newlabel{section:related}{{10}{11}}
    121 \@writefile{lof}{\contentsline {figure}{\numberline {14}{\ignorespaces Average Statistic of Multithreaded Parabix\relax }}{11}}
    122 \newlabel{multithread_perf}{{14}{11}}
    123 \@writefile{toc}{\contentsline {section}{\numberline {11}Conclusion}{11}}
    124 \newlabel{section:conclusion}{{11}{11}}
    125123\bibstyle{ieee/latex8}
    126124\bibdata{reference}
     
    144142\bibcite{NicolaJohn03}{18}
    145143\bibcite{JMBE:31@99}{19}
     144\@writefile{toc}{\contentsline {section}{\numberline {11}Conclusion}{12}}
     145\newlabel{section:conclusion}{{11}{12}}
    146146\bibcite{ParaDOM2009}{20}
    147147\bibcite{Shah:2009}{21}
  • docs/HPCA2012/final_ieee/final.log

    r1768 r1774  
    1 This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009/Debian) (format=pdflatex 2011.10.18)  8 DEC 2011 12:16
     1This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009/Debian) (format=pdflatex 2011.4.5)  13 DEC 2011 16:49
    22entering extended mode
    33 %&-line parsing enabled.
     
    66LaTeX2e <2009/09/24>
    77Babel <v3.8l> and hyphenation patterns for english, usenglishmax, dumylang, noh
    8 yphenation, farsi, arabic, croatian, bulgarian, ukrainian, russian, czech, slov
    9 ak, danish, dutch, finnish, french, basque, ngerman, german, german-x-2009-06-1
    10 9, ngerman-x-2009-06-19, ibycus, monogreek, greek, ancientgreek, hungarian, san
    11 skrit, italian, latin, latvian, lithuanian, mongolian2a, mongolian, bokmal, nyn
    12 orsk, romanian, irish, coptic, serbian, turkish, welsh, esperanto, uppersorbian
    13 , estonian, indonesian, interlingua, icelandic, kurmanji, slovenian, polish, po
    14 rtuguese, spanish, galician, catalan, swedish, ukenglish, pinyin, loaded.
     8yphenation, loaded.
    159(./preamble-final-ieee.tex (/usr/share/texmf-texlive/tex/latex/base/article.cls
    1610Document Class: article 2007/10/19 v1.4h Standard LaTeX document class
     
    438432(Font)              Font shape `OT1/pcr/b/n' tried instead on input line 41.
    439433[3]
    440 Overfull \hbox (9.88208pt too wide) in paragraph at lines 228--238
    441  []
    442  []
    443 
    444 
    445 Overfull \hbox (15.88206pt too wide) in paragraph at lines 249--277
     434Overfull \hbox (9.88208pt too wide) in paragraph at lines 227--237
     435 []
     436 []
     437
     438
     439Overfull \hbox (15.88206pt too wide) in paragraph at lines 248--276
    446440 []
    447441 []
     
    458452
    459453[5]
    460 Underfull \hbox (badness 1286) in paragraph at lines 101--114
     454Underfull \hbox (badness 1286) in paragraph at lines 100--113
    461455[] \OT1/ptm/b/n/10 En-ergy Mea-sure-ment:[] \OT1/ptm/m/n/10 A key ben-e-fit of
    462456the Para-bix
     
    480474 []
    481475
    482 <plots/corei3_TOT.pdf, id=70, 457.71pt x 209.78375pt>
     476[7 <./plots/corei3_BM.pdf>]
     477<plots/corei3_TOT.pdf, id=88, 457.71pt x 209.78375pt>
    483478File: plots/corei3_TOT.pdf Graphic file (type pdf)
    484479
    485480<use plots/corei3_TOT.pdf>
    486 <plots/corei3_energy.pdf, id=72, 454.69875pt x 203.76125pt>
     481<plots/corei3_energy.pdf, id=90, 454.69875pt x 203.76125pt>
    487482File: plots/corei3_energy.pdf Graphic file (type pdf)
    488483
    489 <use plots/corei3_energy.pdf>) (./06-scalability.tex [7 <./plots/corei3_BM.pdf>
    490 ] <plots/Parabix2_all_platform.pdf, id=92, 432.61626pt x 263.98625pt>
     484<use plots/corei3_energy.pdf>) (./06-scalability.tex
     485<plots/Parabix2_all_platform.pdf, id=92, 432.61626pt x 263.98625pt>
    491486File: plots/Parabix2_all_platform.pdf Graphic file (type pdf)
    492487
    493488<use plots/Parabix2_all_platform.pdf>
    494 Overfull \hbox (7.22688pt too wide) in paragraph at lines 37--39
     489Overfull \hbox (7.22688pt too wide) in paragraph at lines 33--35
    495490 []
    496491 []
     
    506501File: plots/Markup_density_Intel.pdf Graphic file (type pdf)
    507502
    508 <use plots/Markup_density_Intel.pdf> [8 <./plots/corei3_TOT.pdf> <./plots/corei
    509 3_energy.pdf> <./plots/Parabix2_all_platform.pdf>]
    510 <plots/InsMix.pdf, id=155, 744.7825pt x 261.97874pt>
     503<use plots/Markup_density_Intel.pdf>
     504Underfull \hbox (badness 1210) in paragraph at lines 77--88
     505\OT1/ptm/m/n/10 1Ghz single-core, dual-issue, su-per-scalar mi-cro-pro-ces-sor.
     506
     507 []
     508
     509<plots/InsMix.pdf, id=99, 744.7825pt x 261.97874pt>
    511510File: plots/InsMix.pdf Graphic file (type pdf)
    512511 <use plots/InsMix.pdf>)
    513 (./07-avx.tex [9 <./plots/arm_TOT.pdf> <./plots/Markup_density_Arm.pdf> <./plot
    514 s/Markup_density_Intel.pdf>] <plots/avx.pdf, id=186, 424.58624pt x 212.795pt>
     512(./07-avx.tex [8 <./plots/Parabix2_all_platform.pdf>] [9 <./plots/corei3_TOT.pd
     513f> <./plots/corei3_energy.pdf> <./plots/arm_TOT.pdf> <./plots/Markup_density_Ar
     514m.pdf> <./plots/Markup_density_Intel.pdf> <./plots/InsMix.pdf>]
     515<plots/avx.pdf, id=200, 424.58624pt x 212.795pt>
    515516File: plots/avx.pdf Graphic file (type pdf)
    516 
    517 <use plots/avx.pdf>
     517 <use plots/avx.pdf>
    518518Overfull \hbox (7.22688pt too wide) in paragraph at lines 104--105
    519519 []
    520520 []
    521521
    522 ) (./09-pipeline.tex [10 <./plots/InsMix.pdf> <./plots/avx.pdf>]
    523 Underfull \hbox (badness 1072) in paragraph at lines 76--85
     522) (./09-pipeline.tex [10 <./plots/avx.pdf>]
     523Underfull \hbox (badness 1072) in paragraph at lines 75--84
    524524[]\OT1/ptm/m/n/10 Figure 14[] demon-strates the per-for-mance im-prove-ment
    525525 []
     
    528528File: plots/pipeline.pdf Graphic file (type pdf)
    529529 <use plots/pipeline.pdf>
    530 Overfull \hbox (7.22688pt too wide) in paragraph at lines 99--101
    531  []
    532  []
    533 
    534 ) (./10-related.tex) (./11-conclusions.tex [11 <./plots/pipeline.pdf>])
     530Overfull \hbox (7.22688pt too wide) in paragraph at lines 98--100
     531 []
     532 []
     533
     534) (./10-related.tex [11 <./plots/pipeline.pdf>]) (./11-conclusions.tex)
    535535(./final.bbl
    536536Underfull \hbox (badness 1137) in paragraph at lines 17--22
     
    553553 []
    554554
     555[12]
    555556Missing character: There is no à in font ptmr7t!
    556557Missing character: There is no š in font ptmr7t!
    557 ) [12] (./final.aux) )
     558) [13
     559
     560] (./final.aux) )
    558561Here is how much of TeX's memory you used:
    559  3934 strings out of 493848
    560  54935 string characters out of 1152822
    561  119286 words of memory out of 3000000
    562  7039 multiletter control sequences out of 15000+50000
     562 3934 strings out of 495061
     563 54935 string characters out of 1182622
     564 118305 words of memory out of 3000000
     565 6940 multiletter control sequences out of 15000+50000
    563566 69892 words of font info for 168 fonts, out of 3000000 for 9000
    564  717 hyphenation exceptions out of 8191
     567 31 hyphenation exceptions out of 8191
    565568 38i,12n,38p,1456b,370s stack positions out of 5000i,500n,10000p,200000b,50000s
    566 {/usr/share/texmf-texlive/fonts/enc/dvips/base/8r.enc}</u
    567 sr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmmi10.pfb></usr/share/te
    568 xmf-texlive/fonts/type1/public/amsfonts/cm/cmr10.pfb></usr/share/texmf-texlive/
    569 fonts/type1/public/amsfonts/cm/cmsy10.pfb></usr/share/texmf-texlive/fonts/type1
    570 /public/amsfonts/cm/cmtt10.pfb></usr/share/texmf-texlive/fonts/type1/public/ams
    571 fonts/cm/cmtt8.pfb></usr/share/texmf-texlive/fonts/type1/urw/courier/ucrb8a.pfb
    572 ></usr/share/texmf-texlive/fonts/type1/urw/courier/ucrr8a.pfb></usr/share/texmf
    573 -texlive/fonts/type1/urw/symbol/usyr.pfb></usr/share/texmf-texlive/fonts/type1/
    574 urw/symbol/usyr.pfb></usr/share/texmf-texlive/fonts/type1/urw/times/utmb8a.pfb>
    575 </usr/share/texmf-texlive/fonts/type1/urw/times/utmr8a.pfb></usr/share/texmf-te
    576 xlive/fonts/type1/urw/times/utmri8a.pfb>
    577 Output written on final.pdf (12 pages, 517924 bytes).
     569{/usr/share/texmf-texlive/fonts/enc/dvips/base/8r.enc}</usr/sh
     570are/texmf-texlive/fonts/type1/public/amsfonts/cm/cmmi10.pfb></usr/share/texmf-t
     571exlive/fonts/type1/public/amsfonts/cm/cmr10.pfb></usr/share/texmf-texlive/fonts
     572/type1/public/amsfonts/cm/cmsy10.pfb></usr/share/texmf-texlive/fonts/type1/publ
     573ic/amsfonts/cm/cmtt10.pfb></usr/share/texmf-texlive/fonts/type1/public/amsfonts
     574/cm/cmtt8.pfb></usr/share/texmf-texlive/fonts/type1/urw/courier/ucrb8a.pfb></us
     575r/share/texmf-texlive/fonts/type1/urw/courier/ucrr8a.pfb></usr/share/texmf-texl
     576ive/fonts/type1/urw/symbol/usyr.pfb></usr/share/texmf-texlive/fonts/type1/urw/s
     577ymbol/usyr.pfb></usr/share/texmf-texlive/fonts/type1/urw/times/utmb8a.pfb></usr
     578/share/texmf-texlive/fonts/type1/urw/times/utmr8a.pfb></usr/share/texmf-texlive
     579/fonts/type1/urw/times/utmri8a.pfb>
     580Output written on final.pdf (13 pages, 518284 bytes).
    578581PDF statistics:
    579  275 PDF objects out of 1000 (max. 8388607)
     582 279 PDF objects out of 1000 (max. 8388607)
    580583 0 named destinations out of 1000 (max. 500000)
    581584 61 words of extra memory for PDF output out of 10000 (max. 10000000)
Note: See TracChangeset for help on using the changeset viewer.