Changeset 1326 for docs


Ignore:
Timestamp:
Aug 19, 2011, 4:57:57 PM (8 years ago)
Author:
ashriram
Message:

New Intro New title

Location:
docs/HPCA2011
Files:
12 added
8 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2011/01-intro.tex

    r1302 r1326  
    11\section{Introduction}
    2 
    3 Extensible Markup Language (XML) is a core technology standard
    4 of the World Wide Web Consortium (W3C) that provides a common
    5 framework for encoding and communicating structured information. 
    6 In applications ranging from Office Open XML in
    7 Microsoft Office to NDFD XML of the NOAA National Weather
    8 Service, from KML in Google Earth to Castor XML in the Martian Rovers,
    9 from ebXML for e-commerce data interchange to RSS for news feeds
    10 from web sites everywhere, XML plays a ubiquitous role in providing
    11 a common framework for data interoperability world-wide and beyond.
    12 As XML 1.0 editor Tim Bray is quoted in the W3C celebration of XML at 10 years,
    13 "there is essentially no computer in the world, desk-top, hand-held,
    14 or back-room, that doesn't process XML sometimes."
     2Classical Dennard Scaling~\cite{} which ensured that voltage scaling
     3would enable us to keep all of transistors afforded by Moore's law
     4active, has currently stopped. This has already resulted in a rethink
     5of the way general-purpose processors are built: processor frequencies
     6have remained stagnant over the last 5 years and processor cores in
     7multIntel multicores provide capability to boost core speeds if other
     8cores on the chip are shut-off. Chip makers strive to achieve energy
     9efficient computing by operating at more optimal core frequencies and
     10aiming to increase performance with larger number of
     11cores. Unfortunately, given the levels of
     12parallelism~\cite{blake-isca-2010} in applications, that multicores
     13can exploit it is not certain up to how many cores we can continue
     14scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because
     15exploiting parallelism across multiple cores tends to require
     16heavweight threads that are difficult to manage and synchronize.
     17
     18
     19The desire to improve the overall efficiency of computing is pushing
     20designers to explore customized hardware~\cite{venkatesh-asplos-2010,
     21  hameed-isca-2010} that accelerate specific parts of an application
     22while reducing the overheads present in general-purpose
     23processors. They seek to exploit the transistor bounty to provision
     24many different accelerators and keep only the accelerators needed for
     25an application active while switching-off others on the chip to save
     26power consumption. While promising, given the fast evolution of
     27languages and software, its hard to define a set of fixed-function
     28hardware for commodity processors. Furthermore, the toolchain to
     29create such customized hardware is itself a hard research
     30challenge. We believe that software, applications, and runtime models
     31themselves can be refactored to significantly improve the overall
     32computing efficiency of commodity processors.
     33
     34
     35In this paper, we demonstrate with an XML parser that changes to the
     36underlying algorithm and compute model can significantly improve the
     37efficiency on commodity processors. We achieve this efficiency by
     38carefully redesigning the algorithm to exploit Parallel Bitstream
     39runtime framework (Parabix) that exploits the SIMD extensions (SSE/AVX
     40on x86, Neon on ARM) on commodity processors. The Parabix framework
     41exploits modern instructions in the processor ISA that can execute 10s
     42of operations (on multiple chararacter streams) in a single
     43instruction and amortizes the overhead of general-purpose
     44processor. Parabix also minimizes or eliminate branches entirely
     45resulting in a more efficient pipeline and and improves overall
     46register/cache utilization which minimizes energy wasted on data
     47transfers. Parabix SSE/AVX exploits also include sophisticated
     48instructions that enable the algorithm to pack and unpack the data
     49elements from the registers which makes the overall cache access
     50behavior of the application regular resulting in significantly fewer
     51misses and better utilization. Overall as summarized by
     52Figure~\ref{perf-energy} our Parabix-based XML parser improves the
     53performance by ?$\times$ and energy efficiency by ?$\times$ compared
     54to widely-used software parsers and approaching the performance of
     55?$cycles/input-byte$ performance of ASIC XML
     56parsers~\cite{}.\footnote{The actual energy consumption of the XML
     57  ASIC chips is not published by the companies.}
     58
     59
     60XML is a particularly interesting application; it is a standard of the
     61web consortium that provides a common framework for encoding and
     62communicating data.  XML provides critical data storage for
     63applications ranging from Office Open XML in Microsoft Office to NDFD
     64XML of the NOAA National Weather Service, from KML in Google Earth to
     65Castor XML in the Martian Rovers, a XML data in Android phones.  XML
     66parsing efficiency is important for multiple application areas; in
     67server workloads the key focus in on overall transactions per second
     68while in applications in the network switches and cell phones latency
     69and the energy cost of parsing is of paramount
     70importance. Software-based XML parsers are particulary inefficient and
     71consist of giant \textit{switch-case} statements, which waste
     72processor resources processor since they introduce input-data
     73dependent branches. They also have poor cache efficiency since they
     74sift forward and backward through the input-data stream trying to
     75match the parsed tags.  XML ASIC chips have been around for over 6
     76years, but typically lag behind CPUs in technology due to cost
     77constraints. Our focus is how much can we improve performance of the
     78XML parser on commodity processors with Parabix technology.
     79
     80Overall we make the following contributions in this paper.
     81
     821) We develop an XML parser that demonstrates the impact of
     83redesigning the core of an application to make more efficient use of
     84commodity processors. We compare the Parabix-XML parser against
     85conventional parsers and demonstrate the improvement in overall
     86performance and energy efficiency. We also paralleillize the
     87Parabix-XML parser to enable the different stages in the parser to
     88exploit SIMD units across all the cores. This further improves
     89performance while maintaining the energy consumption constant with the
     90sequential version.
     91
     922) We are the first to compare and contrast SSE/AVX extensions across
     93multiple generation of Intel processors and show that there are
     94performance challenges when using newer generation SIMD extensions,
     95possibly due to their memory interface. We compare ARM's Neon again
     96x86's SIMD extensions and comment on the latency of SIMD operations
     97across these architectures.
     98
     993) Finally, we introduce a runtime framework, \textit{Parabix}, that
     100abstracts the SIMD specifics of the machine (e.g., register widths)
     101and provides a language framework to enable applications to run
     102efficiently on commodity processors. Parabix enables the
     103general-purpose multicores to be used efficiently by an entirely new
     104class of applications, text processing and parsing.
     105
     106
     107
     108
     109
     110
     111\begin{comment}
     112Figure~\ref{perf-energy} is an energy-performance scatter plot showing
     113the results obtained.
     114
    15115
    16116With all this XML processing, a substantial literature has arisen
    17 addressing XML processing performance in general and the
    18 performance of XML parsers in particular.   Nicola and John
    19 specifically identified XML parsing as a threat to database
    20  performance and outlined a number of potential directions for potential
    21 performance improvements \cite{NicolaJohn03}.  The nature of XML
    22 APIs was found to have a significant affect on performance with
    23 event-based SAX (Simple API for XML) parsers avoiding the tree
    24 construction costs of the more flexible DOM (Document Object
    25 Model) parsers \cite{Perkins05}.  The commercial importance
    26 of XML parsing spurred developments of hardware-based approaches
    27 including the development of a custom XML chip \cite{Leventhal2009}
    28 as well as FPGA-based implementations \cite{DaiNiZhu2010}.
    29 However promising these approaches may be for particular niche applications,
    30 it is likely that the bulk of the world's XML
    31 processing workload will be carried out on commodity processors
    32 using software-based solutions.
     117addressing XML processing performance in general and the performance
     118of XML parsers in particular.  Nicola and John specifically identified
     119XML parsing as a threat to database performance and outlined a number
     120of potential directions for potential performance improvements
     121\cite{NicolaJohn03}.  The nature of XML APIs was found to have a
     122significant affect on performance with event-based SAX (Simple API for
     123XML) parsers avoiding the tree construction costs of the more flexible
     124DOM (Document Object Model) parsers \cite{Perkins05}.  The commercial
     125importance of XML parsing spurred developments of hardware-based
     126approaches including the development of a custom XML chip
     127\cite{Leventhal2009} as well as FPGA-based implementations
     128\cite{DaiNiZhu2010}.  However promising these approaches may be for
     129particular niche applications, it is likely that the bulk of the
     130world's XML processing workload will be carried out on commodity
     131processors using software-based solutions.
    33132
    34133To accelerate XML parsing performance in software, most recent
     
    43142benefits over traditional sequential parsing techniques that follow the
    44143byte-at-a-time model.
    45 
    46 With this focus on performance however, relatively little attention
    47 has been paid on reducing energy consumption in XML processing.  For example, in addressing
    48 performance through multicore parallelism, one generally must
    49 pay an energy price for performance gains because of the
    50 increased processing required for synchronization.   
    51 This focus on reduction of energy consumption is a key topic in this
    52 paper. We study the energy and performance
    53 characteristics of several XML parsers across three generations
    54 of x86-64 processor technology.  The parsers we consider are
    55 the widely used byte-at-a-time parsers Expat and Xerces, as well the
    56 Parabix1 and Parabix2 parsers based on parallel bit stream technology. 
    57 A compelling result is that the performance benefits of parallel bit stream technology
    58 translate directly and proportionally to substantial energy savings.
    59 Figure \ref{perf-energy} is an energy-performance scatter plot
    60 showing the results obtained.
     144\end{comment}
     145
     146
    61147
    62148\begin{figure}
     
    69155
    70156The remainder of this paper is organized as follows.
    71 Section 2 presents background material on XML parsing
    72 and traditional parsing methods.  Section 3 reviews
    73 parallel bit stream technology as applied to
    74 XML parsing in the Parabix1 and Parabix2 parsers.
    75 Section 4 introduces our methodology and approach
    76 for the performance and energy study tackled in the
    77 remainder of the paper.  Section 5 presents a
    78 detailed performance evaluation on a \CITHREE\ processor
    79 as our primary evaluation platform, addressing a
    80 number of microarchitectural issues including cache
    81 misses, branch mispredictions, SIMD instruction counts
    82 and so forth.  Section 6 examines scalability and
    83 performance gains through three generations of Intel
    84 architecture culminating with a performance assessment
    85 on our two week-old \SB\ test machine.
    86 Section 7 looks specifically at issues in applying
    87 the new 256-bit AVX technology to parallel bit stream
    88 technology and notes that the major performance benefit
    89 seen so far results from the change to the non-destructive three-operand
    90 instruction format.  Section 8 concludes with
    91 a discussion of ongoing work and further research directions.
    92 
    93 
    94 %Traditional measures of performance fail to capture the impact of energy consumption \cite {bellosa2001}.
    95 %In a study done in 2007, it was estimated that in 2005, the annual operating cost\footnote{This figure only included the cost of server power consumption and cooling;
    96 %it did not account for the cost of network traffic, data storage, service and maintenance or system replacement.} of corporate servers
    97 %and data centers alone was over \$7.2 billion---with the expectation that this cost would increase to \$12.7 billion by 2010 \cite{koomey2007}.
    98 %But when it comes to power consumption, corporate costs are not the only concern: in the world of mobile devices, battery life is paramount.
    99 %While the capabilities and users' expectations of mobile devices has rapidly increased, little imp%rovement to battery technology itself is foreseen in the near future \cite{silven2007, walker2007}.
     157Section~\ref{background} presents background material on XML parsing
     158and provides insight into the inefficiency of traditional parsers on
     159mainstream processors.  Section~\ref{parallel-bitstream} reviews
     160parallel bit stream technology a framework to exploit sophisticated
     161data parallel SIMD extensions on modern processors.  Section 5
     162presents a detailed performance evaluation on a \CITHREE\ processor as
     163our primary evaluation platform, addressing a number of
     164microarchitectural issues including cache misses, branch
     165mispredictions, and SIMD instruction counts.  Section 6 examines
     166scalability and performance gains through three generations of Intel
     167architecture culminating with a performance assessment on our two
     168week-old \SB\ test machine. We looks specifically at issues in
     169applying the new 256-bit AVX technology to parallel bit stream
     170technology and notes that the major performance benefit seen so far
     171results from the change to the non-destructive three-operand
     172instruction format.
     173
     174
     175
    100176
    101177%One area in which both servers and mobile devices devote considerable
  • docs/HPCA2011/09-pipeline.tex

    r1325 r1326  
    11\section{Multi-threaded Parabix}
    2 The general problem of addressing performance through multicore parallelism
    3 is the increasing energy cost. As discussed in previous sections,
    4 Parabix, which applies SIMD-based techniques can not only achieves better performance but consumes less energy.
    5 Moreover, using mulitiple cores, we can further improve the performance of Parabix while keeping the energy consumption at the same level.
    6 
    7 A typical approach to parallelizing software, data parallelism, requires nearly independent data,
    8 However, the nature of XML files makes them hard to partition nicely for data parallelism.
    9 Several approaches have been used to address this problem.
    10 A preparsing phase has been proposed to help partition the XML document \cite{dataparallel}.
    11 The goal of this preparsing is to determine the tree structure of the XML document
    12 so that it can be used to guide the full parsing in the next phase.
    13 Another data parallel algorithm is called ParDOM \cite{Shah:2009}.
    14 It first builds partial DOM node tree structures for each data segments and then link them
    15 using preorder numbers that has been assigned to each start element to determine the ordering among siblings
    16 and a stack to manage the parent-child relationship between elements.
    17 
    18 Theses data parallelism approaches introduce a lot of overheads to solve the data dependencies between segments.
    19 Therefore, instead of partitioning the data and assigning different data segments to different cores,
    20 we propose a pipeline parallelism strategy that partitions the process into several stages and let each core work with one single stage.
    21 
    22 The interface between stages is implemented using a circular array,
    23 where each entry consists of all ten data structures for one segment as listed in Table \ref{pass_structure}.
    24 Each thread keeps an index of the array ($I_N$),
    25 which is compared with the index ($I_{N-1}$) kept by its previous thread before processing the segment.
    26 If $I_N$ is smaller than $I_{N-1}$, thread N can start processing segment $I_N$,
    27 otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is larger than $I_N$.
    28 The time consumed by continuously loading the value of $I_{N-1}$ and
    29 comparing it with $I_N$ will be later referred as stall time.
    30 When a thread finishes processing the segment, it increases the index by one.
    312
    323\begin{table*}[t]
     
    345\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|}
    356\hline
    36        & & \multicolumn{10}{|c|}{Data Structures}\\ \hline
    37        &                & srcbuf & basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & check\_streams\\ \hline
    38 Stage1 &fill\_buffer    & write  &             &      &       &       &        &        &        &            &               \\
    39        &s2p             & read   & write       &      &       &       &        &        &        &            &               \\
    40        &classify\_bytes &        & read        &      & write &       &        &        &        &            &               \\ \hline
    41 Stage2 &validate\_u8    &        & read        & write&       &       &        &        &        &            &               \\
    42        &gen\_scope      &        &             &      & read  & write &        &        &        &            &               \\
    43        &parse\_CtCDPI   &        &             &      & read  & read  & write  &        &        &            & write         \\
    44        &parse\_ref      &        &             &      & read  & read  & read   & write  &        &            &               \\ \hline
    45 Stage3 &parse\_tag      &        &             &      & read  & read  & read   &        & write  &            &               \\
    46        &validate\_name  &        &             & read & read  &       & read   & read   & read   & write      & write         \\
    47        &gen\_check      &        &             & read & read  & read  & read   &        & read   & read       & write         \\ \hline
    48 Stage4 &postprocessing  & read   &             &      & read  &       & read   & read   &        &            & read          \\ \hline
     7Stage Name & \multicolumn{10}{|c|}{Data Structures}\\ \hline
     8                & srcbuf & basis\_bits & u8   & lex   & scope & ctCDPI & ref    & tag    & xml\_names & check\_streams\\ \hline
     9fill\_buffer    & write  &             &      &       &       &        &        &        &            &               \\ \hline
     10s2p             & read   & write       &      &       &       &        &        &        &            &               \\ \hline
     11classify\_bytes &        & read        &      & write &       &        &        &        &            &               \\ \hline
     12validate\_u8    &        & read        & write&       &       &        &        &        &            &               \\ \hline
     13gen\_scope      &        &             &      & read  & write &        &        &        &            &               \\ \hline
     14parse\_CtCDPI   &        &             &      & read  & read  & write  &        &        &            & write         \\ \hline
     15parse\_ref      &        &             &      & read  & read  & read   & write  &        &            &               \\ \hline
     16parse\_tag      &        &             &      & read  & read  & read   &        & write  &            &               \\ \hline
     17validate\_name  &        &             & read & read  &       & read   & read   & read   & write      & write         \\ \hline
     18gen\_check      &        &             & read & read  & read  & read   &        & read   & read       & write         \\ \hline
     19postprocessing  & read   &             &      & read  &       & read   & read   &        &            & read          \\ \hline
    4920\end{tabular}
    5021\end{center}
     
    5324\end{table*}
    5425
    55 Figure \ref{multithread_perf} demonstrates the XML well-formedness checking performance of
    56 the multi-threaded Parabix in comparison with the single-threaded version.
    57 The multi-threaded Parabix is more than two times faster and runs at 2.7 cycles per input byte on the \SB{} machine.
    5826
    5927\begin{figure}
     
    6230\end{center}
    6331\caption{Processing Time (y axis: CPU cycles per byte)}
    64 \label{multithread_perf}
     32\label{perf}
    6533\end{figure}
    66 
    67 Figure \ref{power} shows the average power consumed by the multi-threaded Parabix in comparison with the single-threaded version.
    68 By running four threads and using all the cores at the same time, the power consumption of the multi-threaded Parabix is much higher
    69 than the single-threaded version. However, the energy consumption is about the same, because the multi-threaded Parabix needs less processing time.
    70 In fact, as shown in Figure \ref{energy}, parsing soap.xml using multi-threaded Parabix consumes less energy than using single-threaded Parabix.
    7134
    7235\begin{figure}
    7336\begin{center}
    74 \includegraphics[width=0.5\textwidth]{plots/power.pdf}
     37\includegraphics[width=0.5\textwidth]{plots/perf_energy.pdf}
    7538\end{center}
    76 \caption{Average Power (watts)}
    77 \label{power}
    78 \end{figure}
    79 \begin{figure}
    80 \begin{center}
    81 \includegraphics[width=0.5\textwidth]{plots/energy.pdf}
    82 \end{center}
    83 \caption{Energy Consumption (nJ per byte)}
    84 \label{energy}
     39\caption{Energy vs. Performance (x axis: bytes per cycle, y axis: nJ per byte)}
     40\label{perf_energy}
    8541\end{figure}
    8642
  • docs/HPCA2011/main.aux

    r1325 r1326  
    11\relax
    2 \citation{NicolaJohn03}
    3 \citation{Perkins05}
    4 \citation{Leventhal2009}
    5 \citation{DaiNiZhu2010}
    6 \citation{ZhangPanChiu09,ParaDOM2009,LiWangLiuLi2009}
    7 \citation{XMLSSE42}
    8 \citation{CameronHerdyLin2008,Cameron2009,Cameron2010}
    9 \@writefile{toc}{\contentsline {section}{\numberline {1}Introduction}{1}}
    10 \@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces XML Parser Technology Energy vs. Performance\relax }}{1}}
    11 \providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
    12 \newlabel{perf-energy}{{1}{1}}
     2\ifx\hyper@anchor\@undefined
     3\global \let \oldcontentsline\contentsline
     4\gdef \contentsline#1#2#3#4{\oldcontentsline{#1}{#2}{#3}}
     5\global \let \oldnewlabel\newlabel
     6\gdef \newlabel#1#2{\newlabelxx{#1}#2}
     7\gdef \newlabelxx#1#2#3#4#5#6{\oldnewlabel{#1}{{#2}{#3}}}
     8\AtEndDocument{\let \contentsline\oldcontentsline
     9\let \newlabel\oldnewlabel}
     10\else
     11\global \let \hyper@last\relax
     12\fi
     13
     14\citation{}
     15\citation{blake-isca-2010}
     16\citation{esmaeilzadeh-isca-2011}
     17\citation{venkatesh-asplos-2010,hameed-isca-2010}
     18\@writefile{toc}{\contentsline {section}{\numberline {1}Introduction}{1}{section.1}}
     19\citation{}
     20\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces XML Parser Technology Energy vs. Performance}}{3}{figure.1}}
     21\newlabel{perf-energy}{{1}{3}{XML Parser Technology Energy vs. Performance\relax }{figure.1}{}}
    1322\citation{TR:XML}
    1423\citation{DuCharme04}
    1524\citation{TR:XML}
    1625\citation{Cameron2010}
     26\@writefile{toc}{\contentsline {section}{\numberline {2}Background}{4}{section.2}}
     27\newlabel{section:background}{{2}{4}{Background\relax }{section.2}{}}
     28\@writefile{toc}{\contentsline {subsection}{\numberline {2.1}XML}{4}{subsection.2.1}}
     29\@writefile{toc}{\contentsline {subsection}{\numberline {2.2}Traditional XML Parsers}{4}{subsection.2.2}}
    1730\citation{expat}
    1831\citation{xerces}
     
    2033\citation{ZhangPanChiu09}
    2134\citation{ZhangPanChiu09}
     35\@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces Example XML Document}}{5}{figure.2}}
     36\newlabel{fig:sample_xml}{{2}{5}{Example XML Document\relax }{figure.2}{}}
     37\@writefile{toc}{\contentsline {subsection}{\numberline {2.3}Parallel XML Parsing}{5}{subsection.2.3}}
    2238\citation{Cameron2010}
    2339\citation{CameronHerdyLin2008}
    24 \@writefile{toc}{\contentsline {section}{\numberline {2}Background}{2}}
    25 \newlabel{section:background}{{2}{2}}
    26 \@writefile{toc}{\contentsline {subsection}{\numberline {2.1}XML}{2}}
    27 \@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces Example XML Document\relax }}{2}}
    28 \newlabel{fig:sample_xml}{{2}{2}}
    29 \@writefile{toc}{\contentsline {subsection}{\numberline {2.2}Traditional XML Parsers}{2}}
    30 \@writefile{toc}{\contentsline {subsection}{\numberline {2.3}Parallel XML Parsing}{2}}
    31 \@writefile{toc}{\contentsline {section}{\numberline {3}Parabix}{2}}
    32 \newlabel{section:parabix}{{3}{2}}
    33 \@writefile{toc}{\contentsline {subsection}{\numberline {3.1}Parabix1}{2}}
     40\@writefile{toc}{\contentsline {section}{\numberline {3}Parabix}{6}{section.3}}
     41\newlabel{section:parabix}{{3}{6}{Parabix\relax }{section.3}{}}
     42\@writefile{toc}{\contentsline {subsection}{\numberline {3.1}Parabix1}{6}{subsection.3.1}}
    3443\citation{CameronHerdyLin2008,Herdy2008,Cameron2009}
     44\@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces Example 8-bit ASCII Character Basis Bit Streams}}{7}{figure.3}}
     45\newlabel{fig:BitstreamsExample}{{3}{7}{Example 8-bit ASCII Character Basis Bit Streams\relax }{figure.3}{}}
    3546\citation{Cameron2010}
    36 \@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces Example 8-bit ASCII Character Basis Bit Streams\relax }}{3}}
    37 \newlabel{fig:BitstreamsExample}{{3}{3}}
    38 \@writefile{toc}{\contentsline {subsection}{\numberline {3.2}Parabix2}{3}}
    39 \@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces Parabix1 Start Tag Validation\relax }}{3}}
    40 \newlabel{fig:Parabix1StarttagExample}{{4}{3}}
    41 \@writefile{lof}{\contentsline {figure}{\numberline {5}{\ignorespaces Parabix2 Start Tag Validation\relax }}{3}}
    42 \newlabel{fig:Parabix2StarttagExample}{{5}{3}}
    43 \@writefile{toc}{\contentsline {subsection}{\numberline {3.3}Parallel Bit Stream Compilation}{3}}
     47\@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces Parabix1 Start Tag Validation}}{8}{figure.4}}
     48\newlabel{fig:Parabix1StarttagExample}{{4}{8}{Parabix1 Start Tag Validation\relax }{figure.4}{}}
     49\@writefile{toc}{\contentsline {subsection}{\numberline {3.2}Parabix2}{8}{subsection.3.2}}
     50\@writefile{lof}{\contentsline {figure}{\numberline {5}{\ignorespaces Parabix2 Start Tag Validation}}{9}{figure.5}}
     51\newlabel{fig:Parabix2StarttagExample}{{5}{9}{Parabix2 Start Tag Validation\relax }{figure.5}{}}
     52\@writefile{toc}{\contentsline {subsection}{\numberline {3.3}Parallel Bit Stream Compilation}{9}{subsection.3.3}}
    4453\citation{bellosa2001,bertran2010,bircher2007}
    4554\citation{bellosa2001}
     
    5059\citation{xerces}
    5160\citation{expat}
    52 \@writefile{toc}{\contentsline {section}{\numberline {4}Methodology}{4}}
    53 \@writefile{toc}{\contentsline {subsection}{\numberline {4.1}Parsers}{4}}
    54 \newlabel{parsers}{{4.1}{4}}
    55 \@writefile{toc}{\contentsline {subsection}{\numberline {4.2}Workloads}{4}}
    56 \newlabel{workloads}{{4.2}{4}}
    57 \@writefile{toc}{\contentsline {subsection}{\numberline {4.3}Platform Hardware}{4}}
    58 \@writefile{toc}{\contentsline {paragraph}{Intel Core2{}}{4}}
    59 \@writefile{lot}{\contentsline {table}{\numberline {2}{\ignorespaces Core2{}\relax }}{4}}
    60 \newlabel{core2info}{{2}{4}}
    61 \@writefile{toc}{\contentsline {paragraph}{Intel Core-i3{}}{4}}
    62 \@writefile{lot}{\contentsline {table}{\numberline {3}{\ignorespaces Core-i3{}\relax }}{4}}
    63 \newlabel{i3info}{{3}{4}}
    64 \@writefile{toc}{\contentsline {paragraph}{Intel Core-i5{}}{4}}
    65 \@writefile{toc}{\contentsline {subsection}{\numberline {4.4}PMC Hardware Events}{4}}
    66 \newlabel{events}{{4.4}{4}}
     61\@writefile{toc}{\contentsline {section}{\numberline {4}Methodology}{10}{section.4}}
     62\@writefile{toc}{\contentsline {subsection}{\numberline {4.1}Parsers}{10}{subsection.4.1}}
     63\newlabel{parsers}{{4.1}{10}{Parsers\relax }{subsection.4.1}{}}
     64\@writefile{lot}{\contentsline {table}{\numberline {1}{\ignorespaces XML Document Characteristics}}{11}{table.1}}
     65\newlabel{XMLDocChars}{{1}{11}{XML Document Characteristics\relax }{table.1}{}}
     66\@writefile{toc}{\contentsline {subsection}{\numberline {4.2}Workloads}{11}{subsection.4.2}}
     67\newlabel{workloads}{{4.2}{11}{Workloads\relax }{subsection.4.2}{}}
     68\@writefile{toc}{\contentsline {subsection}{\numberline {4.3}Platform Hardware}{11}{subsection.4.3}}
     69\@writefile{toc}{\contentsline {paragraph}{Intel Core2{}}{11}{section*.1}}
     70\@writefile{toc}{\contentsline {paragraph}{Intel Core-i3{}}{11}{section*.2}}
     71\@writefile{lot}{\contentsline {table}{\numberline {2}{\ignorespaces Core2{}}}{12}{table.2}}
     72\newlabel{core2info}{{2}{12}{\CO {}\relax }{table.2}{}}
     73\@writefile{lot}{\contentsline {table}{\numberline {3}{\ignorespaces Core-i3{}}}{12}{table.3}}
     74\newlabel{i3info}{{3}{12}{\CITHREE {}\relax }{table.3}{}}
     75\@writefile{toc}{\contentsline {paragraph}{Intel Core-i5{}}{12}{section*.3}}
     76\@writefile{lot}{\contentsline {table}{\numberline {4}{\ignorespaces SandyBridge{}}}{12}{table.4}}
     77\newlabel{sandybridgeinfo}{{4}{12}{\SB {}\relax }{table.4}{}}
     78\@writefile{toc}{\contentsline {subsection}{\numberline {4.4}PMC Hardware Events}{12}{subsection.4.4}}
     79\newlabel{events}{{4.4}{12}{PMC Hardware Events\relax }{subsection.4.4}{}}
    6780\citation{clamp}
    68 \@writefile{lot}{\contentsline {table}{\numberline {1}{\ignorespaces XML Document Characteristics\relax }}{5}}
    69 \newlabel{XMLDocChars}{{1}{5}}
    70 \@writefile{lot}{\contentsline {table}{\numberline {4}{\ignorespaces SandyBridge{}\relax }}{5}}
    71 \newlabel{sandybridgeinfo}{{4}{5}}
    72 \@writefile{toc}{\contentsline {subsection}{\numberline {4.5}Energy Measurement}{5}}
    73 \@writefile{toc}{\contentsline {section}{\numberline {5}Baseline Evaluation on Core-i3{}}{5}}
    74 \@writefile{toc}{\contentsline {subsection}{\numberline {5.1}Cache behavior}{5}}
    75 \@writefile{toc}{\contentsline {subsection}{\numberline {5.2}Branch Mispredictions}{5}}
    76 \@writefile{lof}{\contentsline {figure}{\numberline {6}{\ignorespaces Core-i3\ --- L1 Data Cache Misses (y-axis: Cache Misses per kB)\relax }}{5}}
    77 \newlabel{corei3_L1DM}{{6}{5}}
    78 \@writefile{lof}{\contentsline {figure}{\numberline {7}{\ignorespaces Core-i3\ --- L2 Data Cache Misses (y-axis: Cache Misses per kB)\relax }}{5}}
    79 \newlabel{corei3_L2DM}{{7}{5}}
     81\@writefile{toc}{\contentsline {subsection}{\numberline {4.5}Energy Measurement}{13}{subsection.4.5}}
     82\@writefile{toc}{\contentsline {section}{\numberline {5}Baseline Evaluation on Core-i3{}}{13}{section.5}}
     83\@writefile{toc}{\contentsline {subsection}{\numberline {5.1}Cache behavior}{13}{subsection.5.1}}
     84\@writefile{toc}{\contentsline {subsection}{\numberline {5.2}Branch Mispredictions}{13}{subsection.5.2}}
     85\@writefile{lof}{\contentsline {figure}{\numberline {6}{\ignorespaces Core-i3\ --- L1 Data Cache Misses (y-axis: Cache Misses per kB)}}{14}{figure.6}}
     86\newlabel{corei3_L1DM}{{6}{14}{\CITHREE \ --- L1 Data Cache Misses (y-axis: Cache Misses per kB)\relax }{figure.6}{}}
     87\@writefile{lof}{\contentsline {figure}{\numberline {7}{\ignorespaces Core-i3\ --- L2 Data Cache Misses (y-axis: Cache Misses per kB)}}{14}{figure.7}}
     88\newlabel{corei3_L2DM}{{7}{14}{\CITHREE \ --- L2 Data Cache Misses (y-axis: Cache Misses per kB)\relax }{figure.7}{}}
     89\@writefile{lof}{\contentsline {figure}{\numberline {8}{\ignorespaces Core-i3\ --- L3 Cache Misses (y-axis: Cache Misses per kB)}}{14}{figure.8}}
     90\newlabel{corei3_L3TM}{{8}{14}{\CITHREE \ --- L3 Cache Misses (y-axis: Cache Misses per kB)\relax }{figure.8}{}}
     91\@writefile{lof}{\contentsline {figure}{\numberline {9}{\ignorespaces Core-i3\ --- Branch Instructions (y-axis: Branches per kB)}}{15}{figure.9}}
     92\newlabel{corei3_BR}{{9}{15}{\CITHREE \ --- Branch Instructions (y-axis: Branches per kB)\relax }{figure.9}{}}
    8093\citation{Cameron2008}
    81 \@writefile{lof}{\contentsline {figure}{\numberline {8}{\ignorespaces Core-i3\ --- L3 Cache Misses (y-axis: Cache Misses per kB)\relax }}{6}}
    82 \newlabel{corei3_L3TM}{{8}{6}}
    83 \@writefile{lof}{\contentsline {figure}{\numberline {9}{\ignorespaces Core-i3\ --- Branch Instructions (y-axis: Branches per kB)\relax }}{6}}
    84 \newlabel{corei3_BR}{{9}{6}}
    85 \@writefile{toc}{\contentsline {subsection}{\numberline {5.3}SIMD Instructions vs. Total Instructions}{6}}
    86 \@writefile{lof}{\contentsline {figure}{\numberline {10}{\ignorespaces Core-i3\ --- Branch Mispredictions (y-axis: Branch Mispredictions per kB)\relax }}{6}}
    87 \newlabel{corei3_BM}{{10}{6}}
    88 \@writefile{lof}{\contentsline {figure}{\numberline {11}{\ignorespaces Parabix1 --- SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions\relax }}{6}}
    89 \newlabel{corei3_INS_p1}{{11}{6}}
    90 \@writefile{toc}{\contentsline {subsection}{\numberline {5.4}CPU Cycles}{6}}
    91 \@writefile{lof}{\contentsline {figure}{\numberline {12}{\ignorespaces Parabix2 --- SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions)\relax }}{7}}
    92 \newlabel{corei3_INS_p2}{{12}{7}}
    93 \@writefile{lof}{\contentsline {figure}{\numberline {13}{\ignorespaces Core-i3\ --- Performance (y-axis: CPU Cycles per kB)\relax }}{7}}
    94 \newlabel{corei3_TOT}{{13}{7}}
    95 \@writefile{toc}{\contentsline {subsection}{\numberline {5.5}Power and Energy}{7}}
    96 \@writefile{toc}{\contentsline {section}{\numberline {6}Scalability}{7}}
    97 \@writefile{lof}{\contentsline {figure}{\numberline {14}{\ignorespaces Core-i3\ --- Average Power Consumption (watts)\relax }}{7}}
    98 \newlabel{corei3_power}{{14}{7}}
    99 \@writefile{lof}{\contentsline {figure}{\numberline {15}{\ignorespaces Core-i3\ --- Energy Consumption ($\mu $J per kB)\relax }}{7}}
    100 \newlabel{corei3_energy}{{15}{7}}
    101 \@writefile{toc}{\contentsline {subsection}{\numberline {6.1}Performance}{7}}
    102 \@writefile{lof}{\contentsline {figure}{\numberline {16}{\ignorespaces Average Performance Parabix vs. Expat (y-axis: CPU Cycles per kB)\relax }}{8}}
    103 \@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {Parabix2}}}{8}}
    104 \@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {Expat}}}{8}}
    105 \newlabel{Scalability}{{16}{8}}
    106 \@writefile{toc}{\contentsline {subsection}{\numberline {6.2}Power and Energy}{8}}
    107 \@writefile{toc}{\contentsline {section}{\numberline {7}Scaling Parabix2 for AVX}{8}}
    108 \@writefile{toc}{\contentsline {subsection}{\numberline {7.1}Three Operand Form}{8}}
    109 \@writefile{lof}{\contentsline {figure}{\numberline {17}{\ignorespaces Average Power of Parabix2 (watts)\relax }}{8}}
    110 \newlabel{power_Parabix2}{{17}{8}}
    111 \@writefile{lof}{\contentsline {figure}{\numberline {18}{\ignorespaces Energy consumption of Parabix2 (nJ/B)\relax }}{8}}
    112 \newlabel{energy_Parabix2}{{18}{8}}
    113 \@writefile{lof}{\contentsline {figure}{\numberline {20}{\ignorespaces Parabix2 Performance (y-axis: CPU Cycles per kB)\relax }}{8}}
    114 \newlabel{avx}{{20}{8}}
    115 \@writefile{lof}{\contentsline {figure}{\numberline {19}{\ignorespaces Parabix2 Instruction Counts (y-axis: Instructions per kB)\relax }}{9}}
    116 \newlabel{insmix}{{19}{9}}
    117 \@writefile{toc}{\contentsline {subsection}{\numberline {7.2}256-bit AVX Operations}{9}}
    118 \@writefile{toc}{\contentsline {subsection}{\numberline {7.3}Performance Results}{9}}
    119 \@writefile{toc}{\contentsline {section}{\numberline {8}Parabix2 on GT-P1000M}{9}}
    120 \citation{dataparallel}
    121 \citation{Shah:2009}
    122 \@writefile{lof}{\contentsline {figure}{\numberline {21}{\ignorespaces Parabix2 Performance on GT-P1000M (y-axis: CPU Cycles per kB)\relax }}{10}}
    123 \newlabel{arm_processing_time}{{21}{10}}
    124 \@writefile{toc}{\contentsline {subsection}{\numberline {8.1}Platform Hardware}{10}}
    125 \@writefile{lot}{\contentsline {table}{\numberline {5}{\ignorespaces GT-P1000M\relax }}{10}}
    126 \newlabel{arminfo}{{5}{10}}
    127 \@writefile{toc}{\contentsline {subsection}{\numberline {8.2}Performance Results}{10}}
    128 \@writefile{lof}{\contentsline {figure}{\numberline {22}{\ignorespaces Relative Slow Down of Parbix2 and Expat on GT-P1000M vs. Core-i3{} \relax }}{10}}
    129 \newlabel{relative_performance_arm_vs_i3}{{22}{10}}
    130 \@writefile{toc}{\contentsline {section}{\numberline {9}Multi-threaded Parabix}{10}}
     94\@writefile{lof}{\contentsline {figure}{\numberline {10}{\ignorespaces Core-i3\ --- Branch Mispredictions (y-axis: Branch Mispredictions per kB)}}{16}{figure.10}}
     95\newlabel{corei3_BM}{{10}{16}{\CITHREE \ --- Branch Mispredictions (y-axis: Branch Mispredictions per kB)\relax }{figure.10}{}}
     96\@writefile{toc}{\contentsline {subsection}{\numberline {5.3}SIMD Instructions vs. Total Instructions}{16}{subsection.5.3}}
     97\@writefile{toc}{\contentsline {subsection}{\numberline {5.4}CPU Cycles}{16}{subsection.5.4}}
     98\@writefile{lof}{\contentsline {figure}{\numberline {11}{\ignorespaces Parabix1 --- SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions}}{17}{figure.11}}
     99\newlabel{corei3_INS_p1}{{11}{17}{Parabix1 --- SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions\relax }{figure.11}{}}
     100\@writefile{lof}{\contentsline {figure}{\numberline {12}{\ignorespaces Parabix2 --- SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions)}}{17}{figure.12}}
     101\newlabel{corei3_INS_p2}{{12}{17}{Parabix2 --- SIMD vs. Non-SIMD Instructions (y-axis: Percent SIMD Instructions)\relax }{figure.12}{}}
     102\@writefile{toc}{\contentsline {subsection}{\numberline {5.5}Power and Energy}{17}{subsection.5.5}}
     103\@writefile{lof}{\contentsline {figure}{\numberline {13}{\ignorespaces Core-i3\ --- Performance (y-axis: CPU Cycles per kB)}}{18}{figure.13}}
     104\newlabel{corei3_TOT}{{13}{18}{\CITHREE \ --- Performance (y-axis: CPU Cycles per kB)\relax }{figure.13}{}}
     105\@writefile{lof}{\contentsline {figure}{\numberline {14}{\ignorespaces Core-i3\ --- Average Power Consumption (watts)}}{18}{figure.14}}
     106\newlabel{corei3_power}{{14}{18}{\CITHREE \ --- Average Power Consumption (watts)\relax }{figure.14}{}}
     107\@writefile{lof}{\contentsline {figure}{\numberline {15}{\ignorespaces Core-i3\ --- Energy Consumption ($\mu $J per kB)}}{19}{figure.15}}
     108\newlabel{corei3_energy}{{15}{19}{\CITHREE \ --- Energy Consumption ($\mu $J per kB)\relax }{figure.15}{}}
     109\@writefile{toc}{\contentsline {section}{\numberline {6}Scalability}{19}{section.6}}
     110\@writefile{toc}{\contentsline {subsection}{\numberline {6.1}Performance}{19}{subsection.6.1}}
     111\@writefile{lof}{\contentsline {figure}{\numberline {16}{\ignorespaces Average Performance Parabix vs. Expat (y-axis: CPU Cycles per kB)}}{20}{figure.16}}
     112\newlabel{Scalability}{{16}{20}{Average Performance Parabix vs. Expat (y-axis: CPU Cycles per kB)\relax }{figure.16}{}}
     113\@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {Parabix2}}}{20}{figure.16}}
     114\@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {Expat}}}{20}{figure.16}}
     115\@writefile{lof}{\contentsline {figure}{\numberline {17}{\ignorespaces Average Power of Parabix2 (watts)}}{20}{figure.17}}
     116\newlabel{power_Parabix2}{{17}{20}{Average Power of Parabix2 (watts)\relax }{figure.17}{}}
     117\@writefile{toc}{\contentsline {subsection}{\numberline {6.2}Power and Energy}{20}{subsection.6.2}}
     118\@writefile{lof}{\contentsline {figure}{\numberline {18}{\ignorespaces Energy consumption of Parabix2 (nJ/B)}}{21}{figure.18}}
     119\newlabel{energy_Parabix2}{{18}{21}{Energy consumption of Parabix2 (nJ/B)\relax }{figure.18}{}}
     120\@writefile{lof}{\contentsline {figure}{\numberline {19}{\ignorespaces Parabix2 Instruction Counts (y-axis: Instructions per kB)}}{21}{figure.19}}
     121\newlabel{insmix}{{19}{21}{Parabix2 Instruction Counts (y-axis: Instructions per kB)\relax }{figure.19}{}}
     122\@writefile{toc}{\contentsline {section}{\numberline {7}Scaling Parabix2 for AVX}{21}{section.7}}
     123\@writefile{toc}{\contentsline {subsection}{\numberline {7.1}Three Operand Form}{21}{subsection.7.1}}
     124\@writefile{lof}{\contentsline {figure}{\numberline {20}{\ignorespaces Parabix2 Performance (y-axis: CPU Cycles per kB)}}{22}{figure.20}}
     125\newlabel{avx}{{20}{22}{Parabix2 Performance (y-axis: CPU Cycles per kB)\relax }{figure.20}{}}
     126\@writefile{toc}{\contentsline {subsection}{\numberline {7.2}256-bit AVX Operations}{22}{subsection.7.2}}
     127\@writefile{toc}{\contentsline {subsection}{\numberline {7.3}Performance Results}{22}{subsection.7.3}}
     128\@writefile{toc}{\contentsline {section}{\numberline {8}Parabix2 on GT-P1000M}{24}{section.8}}
     129\@writefile{toc}{\contentsline {subsection}{\numberline {8.1}Platform Hardware}{24}{subsection.8.1}}
     130\@writefile{lot}{\contentsline {table}{\numberline {5}{\ignorespaces GT-P1000M}}{24}{table.5}}
     131\newlabel{arminfo}{{5}{24}{GT-P1000M\relax }{table.5}{}}
     132\@writefile{toc}{\contentsline {subsection}{\numberline {8.2}Performance Results}{24}{subsection.8.2}}
     133\@writefile{lof}{\contentsline {figure}{\numberline {21}{\ignorespaces Parabix2 Performance on GT-P1000M (y-axis: CPU Cycles per kB)}}{25}{figure.21}}
     134\newlabel{arm_processing_time}{{21}{25}{Parabix2 Performance on GT-P1000M (y-axis: CPU Cycles per kB)\relax }{figure.21}{}}
     135\@writefile{lof}{\contentsline {figure}{\numberline {22}{\ignorespaces Relative Slow Down of Parbix2 and Expat on GT-P1000M vs. Core-i3{} }}{26}{figure.22}}
     136\newlabel{relative_performance_arm_vs_i3}{{22}{26}{Relative Slow Down of Parbix2 and Expat on GT-P1000M vs. \CITHREE {} \relax }{figure.22}{}}
     137\@writefile{lot}{\contentsline {table}{\numberline {6}{\ignorespaces Relationship between Each Pass and Data Structures}}{26}{table.6}}
     138\newlabel{pass_structure}{{6}{26}{Relationship between Each Pass and Data Structures\relax }{table.6}{}}
     139\@writefile{lof}{\contentsline {figure}{\numberline {23}{\ignorespaces Processing Time (y axis: CPU cycles per byte)}}{26}{figure.23}}
     140\newlabel{perf}{{23}{26}{Processing Time (y axis: CPU cycles per byte)\relax }{figure.23}{}}
    131141\bibstyle{abbrv}
    132142\bibdata{reference}
    133143\bibcite{bellosa2001}{1}
    134144\bibcite{bertran2010}{2}
    135 \@writefile{lot}{\contentsline {table}{\numberline {6}{\ignorespaces Relationship between Each Pass and Data Structures\relax }}{11}}
    136 \newlabel{pass_structure}{{6}{11}}
    137 \@writefile{lof}{\contentsline {figure}{\numberline {23}{\ignorespaces Processing Time (y axis: CPU cycles per byte)\relax }}{11}}
    138 \newlabel{multithread_perf}{{23}{11}}
    139 \@writefile{lof}{\contentsline {figure}{\numberline {24}{\ignorespaces Average Power (watts)\relax }}{11}}
    140 \newlabel{power}{{24}{11}}
    141 \@writefile{lof}{\contentsline {figure}{\numberline {25}{\ignorespaces Energy Consumption (nJ per byte)\relax }}{11}}
    142 \newlabel{energy}{{25}{11}}
    143 \@writefile{toc}{\contentsline {section}{\numberline {10}Conclusion}{11}}
    144 \@writefile{toc}{\contentsline {section}{\numberline {11}References}{11}}
     145\@writefile{lof}{\contentsline {figure}{\numberline {24}{\ignorespaces Energy vs. Performance (x axis: bytes per cycle, y axis: nJ per byte)}}{27}{figure.24}}
     146\newlabel{perf_energy}{{24}{27}{Energy vs. Performance (x axis: bytes per cycle, y axis: nJ per byte)\relax }{figure.24}{}}
     147\@writefile{toc}{\contentsline {section}{\numberline {9}Multi-threaded Parabix}{27}{section.9}}
     148\@writefile{toc}{\contentsline {section}{\numberline {10}Conclusion}{27}{section.10}}
    145149\bibcite{bircher2007}{3}
    146 \bibcite{TR:XML}{4}
    147 \bibcite{Cameron2009}{5}
    148 \bibcite{Cameron2008}{6}
    149 \bibcite{Cameron2010}{7}
    150 \bibcite{CameronHerdyLin2008}{8}
    151 \bibcite{expat}{9}
    152 \bibcite{clamp}{10}
    153 \bibcite{DaiNiZhu2010}{11}
     150\bibcite{blake-isca-2010}{4}
     151\bibcite{TR:XML}{5}
     152\bibcite{Cameron2009}{6}
     153\bibcite{Cameron2008}{7}
     154\bibcite{Cameron2010}{8}
     155\bibcite{CameronHerdyLin2008}{9}
     156\bibcite{expat}{10}
     157\bibcite{clamp}{11}
    154158\bibcite{DuCharme04}{12}
    155 \bibcite{Perkins05}{13}
     159\bibcite{esmaeilzadeh-isca-2011}{13}
    156160\bibcite{Parabix1}{14}
    157161\bibcite{parabix2}{15}
    158162\bibcite{xerces}{16}
    159 \bibcite{Herdy2008}{17}
    160 \bibcite{XMLSSE42}{18}
    161 \bibcite{Leventhal2009}{19}
    162 \bibcite{LiWangLiuLi2009}{20}
    163 \bibcite{dataparallel}{21}
    164 \bibcite{NicolaJohn03}{22}
    165 \bibcite{ParaDOM2009}{23}
    166 \bibcite{Shah:2009}{24}
    167 \bibcite{ZhangPanChiu09}{25}
     163\bibcite{hameed-isca-2010}{17}
     164\bibcite{Herdy2008}{18}
     165\bibcite{venkatesh-asplos-2010}{19}
     166\bibcite{ZhangPanChiu09}{20}
  • docs/HPCA2011/main.bbl

    r1325 r1326  
    2020\newblock In {\em Performance Analysis of Systems Software, 2007. {ISPASS}
    2121  2007. {IEEE} International Symposium on}, pages 158 --168, Apr. 2007.
     22
     23\bibitem{blake-isca-2010}
     24G.~Blake, R.~G. Dreslinski, T.~Mudge, and K.~Flautner.
     25\newblock Evolution of thread-level parallelism in desktop applications.
     26\newblock In {\em Proceedings of the 37th annual international symposium on
     27  Computer architecture}, ISCA '10, 2010.
    2228
    2329\bibitem{TR:XML}
     
    6571\newblock {http://www.fluke.com/}.
    6672
    67 \bibitem{DaiNiZhu2010}
    68 Z.~Dai, N.~Ni, and J.~Zhu.
    69 \newblock A 1 cycle-per-byte {XML} parsing accelerator.
    70 \newblock In {\em FPGA '10: Proceedings of the 18th Annual {ACM/SIGDA}
    71   International Symposium on Field Programmable Gate Arrays}, pages 199--208,
    72   New York, NY, USA, 2010. ACM.
    73 
    7473\bibitem{DuCharme04}
    7574B.~DuCharme.
     
    7776\newblock In {\em {XML 2004}}, {Washington D.C.}, 2004.
    7877
    79 \bibitem{Perkins05}
    80 {E. Perkins and M. Kostoulas and A. Heifets and M. Matsa and N. Mendelsohn}.
    81 \newblock {Performance Analysis of {XML} APIs}.
    82 \newblock In {\em XML 2005}, Atlanta, Georgia, Nov. 2005.
     78\bibitem{esmaeilzadeh-isca-2011}
     79H.~Esmaeilzadeh, E.~Blem, R.~St.~Amant, K.~Sankaralingam, and D.~Burger.
     80\newblock Dark silicon and the end of multicore scaling.
     81\newblock In {\em Proceeding of the 38th annual international symposium on
     82  Computer architecture}, ISCA '11, 2011.
    8383
    8484\bibitem{Parabix1}
     
    9797\newblock {http://xerces.apache.org/xerces-c/}.
    9898
     99\bibitem{hameed-isca-2010}
     100R.~Hameed, W.~Qadeer, M.~Wachs, O.~Azizi, A.~Solomatnikov, B.~C. Lee,
     101  S.~Richardson, C.~Kozyrakis, and M.~Horowitz.
     102\newblock Understanding sources of inefficiency in general-purpose chips.
     103\newblock In {\em Proceedings of the 37th annual international symposium on
     104  Computer architecture}, ISCA '10, 2010.
     105
    99106\bibitem{Herdy2008}
    100107K.~S. Herdy, D.~S. Burggraf, and R.~D. Cameron.
     
    103110\newblock In {\em Proceedings of {SVG} Open 2008}, August 2008.
    104111
    105 \bibitem{XMLSSE42}
    106 Z.~Lei.
    107 \newblock {XML} parsing accelerator with {Intel} streaming {SIMD} extensions 4
    108   ({Intel} {SSE4}).
    109 \newblock
    110   {http://software.intel.com/en-us/articles/xml-parsing-accelerator-with-intel%
    111 -streaming-simd-extensions-4-intel-sse4/}, 2008.
    112 
    113 \bibitem{Leventhal2009}
    114 M.~Leventhal and E.~Lemoine.
    115 \newblock The {XML} chip at 6 years.
    116 \newblock In {\em International Symposium on Processing {XML} Efficiently:
    117   Overcoming Limits on Space, Time, or Bandwidth}, Aug. 2009.
    118 
    119 \bibitem{LiWangLiuLi2009}
    120 X.~Li, H.~Wang, T.~Liu, and W.~Li.
    121 \newblock Key elements tracing method for parallel {XML} parsing in multi-core
    122   system.
    123 \newblock {\em Parallel and Distributed Computing Applications and
    124   Technologies, International Conference on}, 0:439--444, 2009.
    125 
    126 \bibitem{dataparallel}
    127 W.~Lu, Y.~Pan, , and K.~Chiu.
    128 \newblock A parallel approach to xml parsing.
    129 \newblock {\em The 7th IEEE/ACM International Conference on Grid Computing},
    130   2006.
    131 
    132 \bibitem{NicolaJohn03}
    133 {Matthias Nicola and Jasmi John}.
    134 \newblock {XML Parsing: A Threat to Database Performance}.
    135 \newblock In {\em Proceedings of the Twelfth International Conference on
    136   Information and Knowledge Management}, New Orleans, Louisiana, 2003.
    137 
    138 \bibitem{ParaDOM2009}
    139 B.~Shah, P.~Rao, B.~Moon, and M.~Rajagopalan.
    140 \newblock A data parallel algorithm for {XML DOM} parsing.
    141 \newblock In Z.~BellahsÚne, E.~Hunt, M.~Rys, and R.~Unland, editors, {\em
    142   Database and XML Technologies}, volume 5679 of {\em Lecture Notes in Computer
    143   Science}, pages 75--90. Springer Berlin / Heidelberg, 2009.
    144 
    145 \bibitem{Shah:2009}
    146 B.~Shah, P.~R. Rao, B.~Moon, and M.~Rajagopalan.
    147 \newblock A data parallel algorithm for xml dom parsing.
    148 \newblock In {\em Proceedings of the 6th International XML Database Symposium
    149   on Database and XML Technologies}, XSym '09, pages 75--90, Berlin,
    150   Heidelberg, 2009. Springer-Verlag.
     112\bibitem{venkatesh-asplos-2010}
     113G.~Venkatesh, J.~Sampson, N.~Goulding, S.~Garcia, V.~Bryksin, J.~Lugo-Martinez,
     114  S.~Swanson, and M.~B. Taylor.
     115\newblock Conservation cores: reducing the energy of mature computations.
     116\newblock In {\em Proceedings of the fifteenth edition of ASPLOS on
     117  Architectural support for programming languages and operating systems},
     118  ASPLOS '10, 2010.
    151119
    152120\bibitem{ZhangPanChiu09}
  • docs/HPCA2011/main.tex

    r1302 r1326  
    1 
    2 \input{preamble-final-acm}
     1%\input{preamble-final-acm}
    32%\input{preamble-tr}
    4 %\input{preamble-submit}
     3\input{preamble-submit}
    54%\usepackage{trbibtex}   % use bib-style bibliographic database
    65\usepackage{multicol}
     
    128127%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    129128% ACM title header format
    130 \title{\vspace{-30pt}Energy Efficiency and Scalability of XML Parsing Using SIMD%
     129\title{\vspace{-30pt}How to compute efficiently on commodity processors : The Parabix XML parser.
     130%
    131131% \thanks{%
    132132%   This work was supported in part by NSF grants
     
    163163\date{}
    164164\begin{document}
    165 \toappear{}
    166 \maketitle
    167165\vspace{20pt}
    168166\begin{abstract}
  • docs/HPCA2011/preamble-submit.tex

    r1302 r1326  
    1 \documentclass[11pt,letterpaper]{article}
     1\documentclass[12pt,letterpaper]{article}
     2\usepackage{setspace}
     3\usepackage{latex/iccv}
    24\usepackage{fullpage}
     5\doublespacing
    36\usepackage{times,amsmath,epsfig,amssymb}
    47\usepackage{mathptmx}
     
    1619\usepackage{subfigure,graphicx}
    1720\usepackage{pifont}
    18 \usepackage{pifont}
    1921\pagestyle{plain}
    2022%\date{}
     
    2729\makeatother
    2830\usepackage{verbatim}   % for \comment environment
     31
     32\usepackage[pagebackref=true,breaklinks=true,letterpaper=true,colorlinks,bookmarks=false]{hyperref}
     33
     34
     35% \iccvfinalcopy % *** Uncomment this line for the final submission
     36
     37\def\iccvPaperID{****} % *** Enter the HPCA Paper ID here
     38\def\httilde{\mbox{\tt\raisebox{-.5ex}{\symbol{126}}}}
    2939
    3040
  • docs/HPCA2011/reference.bib

    r1325 r1326  
    480480}
    481481
    482 @inproceedings{Shah:2009,
    483  author = {Shah, Bhavik and Rao, Praveen R. and Moon, Bongki and Rajagopalan, Mohan},
    484  title = {A Data Parallel Algorithm for XML DOM Parsing},
    485  booktitle = {Proceedings of the 6th International XML Database Symposium on Database and XML Technologies},
    486  series = {XSym '09},
    487  year = {2009},
    488  isbn = {978-3-642-03554-8},
    489  location = {Lyon, France},
    490  pages = {75--90},
    491  numpages = {16},
    492  publisher = {Springer-Verlag},
    493  address = {Berlin, Heidelberg},
    494 }
    495 
    496 @article{dataparallel,
    497  author = {Wei Lu and Yinfei Pan and and Kenneth Chiu},
    498  title = {A Parallel Approach to XML Parsing},
    499  journal = {The 7th IEEE/ACM International Conference on Grid Computing},
    500  year = {2006}
    501  }
     482@inproceedings{hameed-isca-2010,
     483 author = {Hameed, Rehan and Qadeer, Wajahat and Wachs, Megan and Azizi, Omid and Solomatnikov, Alex and Lee, Benjamin C. and Richardson, Stephen and Kozyrakis, Christos and Horowitz, Mark},
     484 title = {Understanding sources of inefficiency in general-purpose chips},
     485 booktitle = {Proceedings of the 37th annual international symposium on Computer architecture},
     486 series = {ISCA '10},
     487 year = {2010}
     488}
     489
     490@inproceedings{venkatesh-asplos-2010,
     491 author = {Venkatesh, Ganesh and Sampson, Jack and Goulding, Nathan and Garcia, Saturnino and Bryksin, Vladyslav and Lugo-Martinez, Jose and Swanson, Steven and Taylor, Michael Bedford},
     492 title = {Conservation cores: reducing the energy of mature computations},
     493 booktitle = {Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems},
     494 series = {ASPLOS '10},
     495 year = {2010}
     496}
     497
     498@inproceedings{blake-isca-2010,
     499 author = {Blake, Geoffrey and Dreslinski, Ronald G. and Mudge, Trevor and Flautner, Kriszti\'{a}n},
     500 title = {Evolution of thread-level parallelism in desktop applications},
     501 booktitle = {Proceedings of the 37th annual international symposium on Computer architecture},
     502 series = {ISCA '10},
     503 year = {2010}
     504}
     505
     506@inproceedings{esmaeilzadeh-isca-2011,
     507 author = {Esmaeilzadeh, Hadi and Blem, Emily and St. Amant, Renee and Sankaralingam, Karthikeyan and Burger, Doug},
     508 title = {Dark silicon and the end of multicore scaling},
     509 booktitle = {Proceeding of the 38th annual international symposium on Computer architecture},
     510 series = {ISCA '11},
     511 year = {2011}
     512}
Note: See TracChangeset for help on using the changeset viewer.