Changeset 3057 for docs


Ignore:
Timestamp:
Apr 19, 2013, 4:56:46 PM (6 years ago)
Author:
cameron
Message:

Linking

Location:
docs/Balisage13/Bal2013came0601
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/Balisage13/Bal2013came0601/Bal2013came0601.html

    r3056 r3057  
    274274</div>
    275275<div id="mast"><div class="content">
    276 <h2 class="article-title" id="idp246336"></h2>
     276<h2 class="article-title" id="idp66448"></h2>
    277277<div class="author">
    278278<h3 class="author">Nigel Medforth</h3>
     
    324324</div>
    325325<div class="mast-box">
    326 <p class="title"><a href="javascript:toggle('idp73632')" class="quiet"><img class="toc-icon" src="plus.png" alt="expand" id="icon-idp73632"></a> <span onclick="javascript:toggle('idp73632');return true">Abstract</span></p>
    327 <div class="folder" id="folder-idp73632" style="display:none"><p id="idp73936">Prior research on the acceleration of XML processing using SIMD and multi-core
     326<p class="title"><a href="javascript:toggle('idp67568')" class="quiet"><img class="toc-icon" src="plus.png" alt="expand" id="icon-idp67568"></a> <span onclick="javascript:toggle('idp67568');return true">Abstract</span></p>
     327<div class="folder" id="folder-idp67568" style="display:none"><p id="idp67872">Prior research on the acceleration of XML processing using SIMD and multi-core
    328328            parallelism has lead to a number of interesting research prototypes. This work
    329329            investigates the extent to which the techniques underlying these prototypes could result
     
    339339<p><b>Table of Contents</b></p>
    340340<dl>
    341 <dt><span class="section"><a href="#idp281488" class="toc">Introduction</a></span></dt>
    342 <dt><span class="section"><a href="#idp288352" class="toc">Background</a></span></dt>
     341<dt><span class="section"><a href="#idp275776" class="toc">Introduction</a></span></dt>
     342<dt><span class="section"><a href="#background" class="toc">Background</a></span></dt>
    343343<dd><dl>
    344 <dt><span class="section"><a href="#idp288992" class="toc">Xerces C++ Structure</a></span></dt>
    345 <dt><span class="section"><a href="#idp343824" class="toc">The Parabix Framework</a></span></dt>
    346 <dt><span class="section"><a href="#idp436192" class="toc">Sequential vs. Parallel Paradigm</a></span></dt>
     344<dt><span class="section"><a href="#background-xerces" class="toc">Xerces C++ Structure</a></span></dt>
     345<dt><span class="section"><a href="#idp343136" class="toc">The Parabix Framework</a></span></dt>
     346<dt><span class="section"><a href="#idp435712" class="toc">Sequential vs. Parallel Paradigm</a></span></dt>
    347347</dl></dd>
    348 <dt><span class="section"><a href="#idp440608" class="toc">Architecture</a></span></dt>
     348<dt><span class="section"><a href="#architecture" class="toc">Architecture</a></span></dt>
    349349<dd><dl>
    350 <dt><span class="section"><a href="#idp441280" class="toc">Overview</a></span></dt>
    351 <dt><span class="section"><a href="#idp471264" class="toc">Character Set Adapters</a></span></dt>
    352 <dt><span class="section"><a href="#idp479120" class="toc">Combined Parallel Filtering</a></span></dt>
    353 <dt><span class="section"><a href="#idp496160" class="toc">Content Stream</a></span></dt>
    354 <dt><span class="section"><a href="#idp506880" class="toc">Namespace Handling</a></span></dt>
    355 <dt><span class="section"><a href="#idp550288" class="toc">Error Handling</a></span></dt>
     350<dt><span class="section"><a href="#idp441104" class="toc">Overview</a></span></dt>
     351<dt><span class="section"><a href="#character-set-adapter" class="toc">Character Set Adapters</a></span></dt>
     352<dt><span class="section"><a href="#par-filter" class="toc">Combined Parallel Filtering</a></span></dt>
     353<dt><span class="section"><a href="#contentstream" class="toc">Content Stream</a></span></dt>
     354<dt><span class="section"><a href="#namespace-handling" class="toc">Namespace Handling</a></span></dt>
     355<dt><span class="section"><a href="#errorhandling" class="toc">Error Handling</a></span></dt>
    356356</dl></dd>
    357 <dt><span class="section"><a href="#idp560576" class="toc">Multithreading with Pipeline Parallelism</a></span></dt>
    358 <dt><span class="section"><a href="#idp582480" class="toc">Performance</a></span></dt>
     357<dt><span class="section"><a href="#multithread" class="toc">Multithreading with Pipeline Parallelism</a></span></dt>
     358<dt><span class="section"><a href="#performance" class="toc">Performance</a></span></dt>
    359359<dd><dl>
    360 <dt><span class="section"><a href="#idp585200" class="toc">Xerces C++ SAXCount</a></span></dt>
    361 <dt><span class="section"><a href="#idp609984" class="toc">GML2SVG</a></span></dt>
     360<dt><span class="section"><a href="#idp600160" class="toc">Xerces C++ SAXCount</a></span></dt>
     361<dt><span class="section"><a href="#idp626688" class="toc">GML2SVG</a></span></dt>
    362362</dl></dd>
    363 <dt><span class="section"><a href="#idp627056" class="toc">Conclusion and Future Work</a></span></dt>
     363<dt><span class="section"><a href="#conclusion" class="toc">Conclusion and Future Work</a></span></dt>
    364364</dl>
    365365</div>
    366366<div class="mast-box">
    367 <p class="title"><a href="javascript:toggle('idp75360')" class="linkbox"><img class="toc-icon" src="plus.png" alt="expand" id="icon-idp75360"></a> <span onclick="javascript:toggle('idp75360');return true">Nigel Medforth</span></p>
    368 <div class="folder" id="folder-idp75360" style="display:none">
     367<p class="title"><a href="javascript:toggle('idp69296')" class="linkbox"><img class="toc-icon" src="plus.png" alt="expand" id="icon-idp69296"></a> <span onclick="javascript:toggle('idp69296');return true">Nigel Medforth</span></p>
     368<div class="folder" id="folder-idp69296" style="display:none">
    369369<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:nmedfort@sfu.ca">nmedfort@sfu.ca</a>&gt;</code></h5>
    370370<div class="affiliation">
     
    377377</div>
    378378<div class="personblurb">
    379 <p id="idp57680">Nigel Medforth is a M.Sc. student at Simon Fraser University and the lead
     379<p id="idp51616">Nigel Medforth is a M.Sc. student at Simon Fraser University and the lead
    380380               developer of icXML. He earned a Bachelor of Technology in Information Technology at
    381381               Kwantlen Polytechnic University in 2009 and was awarded the Dean’s Medal for
    382382               Outstanding Achievement.</p>
    383 <p id="idp58688">Nigel is currently researching ways to leverage both the Parabix framework and
     383<p id="idp52624">Nigel is currently researching ways to leverage both the Parabix framework and
    384384               stream-processing models to further accelerate XML parsing within icXML.</p>
    385385</div>
     
    387387</div>
    388388<div class="mast-box">
    389 <p class="title"><a href="javascript:toggle('idp62352')" class="linkbox"><img class="toc-icon" src="plus.png" alt="expand" id="icon-idp62352"></a> <span onclick="javascript:toggle('idp62352');return true">Dan Lin</span></p>
    390 <div class="folder" id="folder-idp62352" style="display:none">
     389<p class="title"><a href="javascript:toggle('idp56288')" class="linkbox"><img class="toc-icon" src="plus.png" alt="expand" id="icon-idp56288"></a> <span onclick="javascript:toggle('idp56288');return true">Dan Lin</span></p>
     390<div class="folder" id="folder-idp56288" style="display:none">
    391391<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:lindanl@sfu.ca">lindanl@sfu.ca</a>&gt;</code></h5>
    392392<div class="affiliation">
     
    394394<p class="orgname">Simon Fraser University </p>
    395395</div>
    396 <div class="personblurb"><p id="idp64064">Dan Lin is a Ph.D student at Simon Fraser University. She earned a Master of Science
     396<div class="personblurb"><p id="idp58000">Dan Lin is a Ph.D student at Simon Fraser University. She earned a Master of Science
    397397             in Computing Science at Simon Fraser University in 2010. Her research focus on on high
    398398             performance algorithms that exploit parallelization strategies on various multicore platforms.
     
    401401</div>
    402402<div class="mast-box">
    403 <p class="title"><a href="javascript:toggle('idp66624')" class="linkbox"><img class="toc-icon" src="plus.png" alt="expand" id="icon-idp66624"></a> <span onclick="javascript:toggle('idp66624');return true">Kenneth Herdy</span></p>
    404 <div class="folder" id="folder-idp66624" style="display:none">
     403<p class="title"><a href="javascript:toggle('idp60560')" class="linkbox"><img class="toc-icon" src="plus.png" alt="expand" id="icon-idp60560"></a> <span onclick="javascript:toggle('idp60560');return true">Kenneth Herdy</span></p>
     404<div class="folder" id="folder-idp60560" style="display:none">
    405405<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:ksherdy@sfu.ca">ksherdy@sfu.ca</a>&gt;</code></h5>
    406406<div class="affiliation">
     
    409409</div>
    410410<div class="personblurb">
    411 <p id="idp68352"> Ken Herdy completed an Advanced Diploma of Technology in Geographical Information
     411<p id="idp62288"> Ken Herdy completed an Advanced Diploma of Technology in Geographical Information
    412412               Systems at the British Columbia Institute of Technology in 2003 and earned a Bachelor
    413413               of Science in Computing Science with a Certificate in Spatial Information Systems at
    414414               Simon Fraser University in 2005. </p>
    415 <p id="idp268240"> Ken is currently pursuing PhD studies in Computing Science at Simon Fraser
     415<p id="idp262528"> Ken is currently pursuing PhD studies in Computing Science at Simon Fraser
    416416               University with industrial scholarship support from the Natural Sciences and
    417417               Engineering Research Council of Canada, the Mathematics of Information Technology and
     
    423423</div>
    424424<div class="mast-box">
    425 <p class="title"><a href="javascript:toggle('idp270976')" class="linkbox"><img class="toc-icon" src="plus.png" alt="expand" id="icon-idp270976"></a> <span onclick="javascript:toggle('idp270976');return true">Rob Cameron</span></p>
    426 <div class="folder" id="folder-idp270976" style="display:none">
     425<p class="title"><a href="javascript:toggle('idp265264')" class="linkbox"><img class="toc-icon" src="plus.png" alt="expand" id="icon-idp265264"></a> <span onclick="javascript:toggle('idp265264');return true">Rob Cameron</span></p>
     426<div class="folder" id="folder-idp265264" style="display:none">
    427427<h5 class="author-email"><code class="email">&lt;<a class="email" href="mailto:cameron@cs.sfu.ca">cameron@cs.sfu.ca</a>&gt;</code></h5>
    428428<div class="affiliation">
     
    434434<p class="orgname">International Characters, Inc.</p>
    435435</div>
    436 <div class="personblurb"><p id="idp272640">Dr. Rob Cameron is Professor of Computing Science and Associate Dean of Applied
     436<div class="personblurb"><p id="idp266928">Dr. Rob Cameron is Professor of Computing Science and Associate Dean of Applied
    437437               Sciences at Simon Fraser University. His research interests include programming
    438438               language and software system technology, with a specific focus on high performance
     
    450450<div id="main">
    451451<div class="article">
    452 <h2 class="article-title" id="idp246336"></h2>
    453 <div class="section" id="idp281488">
     452<h2 class="article-title" id="idp66448"></h2>
     453<div class="section" id="idp275776">
    454454<h2 class="title" style="clear: both">Introduction</h2>
    455 <p id="idp282128">   
     455<p id="idp276416">   
    456456        Parallelization and acceleration of XML parsing is a widely
    457457        studied problem that has seen the development of a number
     
    471471        hide the latency of a "job" that has to be done sequentially \cite{ICWS2008}.
    472472      </p>
    473 <p id="idp283520">
     473<p id="idp277808">
    474474        Fewer efforts have investigated SIMD parallelism, although this approach
    475475        has the potential advantage of improving single core performance as well
     
    482482        achieve further acceleration on multicore systems \cite{HPCA2012}.
    483483      </p>
    484 <p id="idp285424">
     484<p id="idp279712">
    485485        In this paper, we move beyond research prototypes to consider
    486486        the detailed integration of both SIMD and multicore parallelism into the
     
    501501        multiple cores.
    502502      </p>
    503 <p id="idp286864">
     503<p id="idp281152">
    504504        The remainder of this paper is organized as follows.   
    505         Section \ref{background} discusses the structure of the Xerces and Parabix XML parsers and the fundamental
     505          <a class="xref" href="#background" title="Background">section “Background”</a> discusses the structure of the Xerces and Parabix XML parsers and the fundamental
    506506        differences between the two parsing models.   
    507         Section \ref{architecture} then presents the icXML design based on a restructured Xerces architecture to
     507        <a class="xref" href="#architecture" title="Architecture">section “Architecture”</a> then presents the icXML design based on a restructured Xerces architecture to
    508508        incorporate SIMD parallelism using Parabix methods.   
    509         Section \ref{multithread} moves on to consider the multithreading of the icXML architecture
     509        <a class="xref" href="#multithread" title="Multithreading with Pipeline Parallelism">section “Multithreading with Pipeline Parallelism”</a> moves on to consider the multithreading of the icXML architecture
    510510        using the pipeline parallelism model. 
    511         Section \ref{performance} analyzes the performance of both the single-threaded and
     511        <a class="xref" href="#performance" title="Performance">section “Performance”</a> analyzes the performance of both the single-threaded and
    512512        multi-threaded versions of icXML in comparison to original Xerces,
    513513        demonstrating substantial end-to-end acceleration of
    514514        a GML-to-SVG translation application written against the Xerces API.
    515         Section \ref{conclusion} concludes the paper with a discussion of future work and the potential for
     515          <a class="xref" href="#conclusion" title="Conclusion and Future Work">section “Conclusion and Future Work”</a> concludes the paper with a discussion of future work and the potential for
    516516        applying the techniques discussed herein in other application domains.
    517517      </p>
    518518</div>
    519 <div class="section" id="idp288352">
     519<div class="section" id="background">
    520520<h2 class="title" style="clear: both">Background</h2>
    521 <div class="section" id="idp288992">
     521<div class="section" id="background-xerces">
    522522<h3 class="title" style="clear: both">Xerces C++ Structure</h3>
    523 <p id="idp289632"> The Xerces C++ parser is a widely-used standards-conformant
     523<p id="idp300432"> The Xerces C++ parser is a widely-used standards-conformant
    524524            XML parser produced as open-source software
    525525             by the Apache Software Foundation.
     
    532532            parsing using either pull parsing or SAX/SAX2 push-style parsing as well as a DOM
    533533            tree-based parsing interface. </p>
    534 <p id="idp290896">
     534<p id="idp301696">
    535535            Xerces,
    536536            like all traditional parsers, processes XML documents sequentially a byte-at-a-time from
     
    551551<div class="table-wrapper" id="xerces-profile">
    552552<p class="title">Table I</p>
    553 <div class="caption"><p id="idp9104">Execution Time of Top 10 Xerces Functions</p></div>
     553<div class="caption"><p id="idm824832">Execution Time of Top 10 Xerces Functions</p></div>
    554554<table class="table" xml:id="xerces-profile">
    555555<colgroup span="1">
     
    606606</div>
    607607</div>
    608 <div class="section" id="idp343824">
     608<div class="section" id="idp343136">
    609609<h3 class="title" style="clear: both">The Parabix Framework</h3>
    610 <p id="idp344496"> The Parabix (parallel bit stream) framework is a transformative approach to XML
     610<p id="idp343808"> The Parabix (parallel bit stream) framework is a transformative approach to XML
    611611            parsing (and other forms of text processing.) The key idea is to exploit the
    612612            availability of wide SIMD registers (e.g., 128-bit) in commodity processors to represent
     
    638638<div class="table-wrapper" id="xml-bytes">
    639639<p class="title">Table II</p>
    640 <div class="caption"><p id="idp358528">XML Source Data</p></div>
     640<div class="caption"><p id="idp358432">XML Source Data</p></div>
    641641<table class="table" xml:id="xml-bytes">
    642642<colgroup span="1">
     
    667667<div class="table-wrapper" id="xml-bits">
    668668<p class="title">Table III</p>
    669 <div class="caption"><p id="idp374416">8-bit ASCII Basis Bit Streams</p></div>
     669<div class="caption"><p id="idp374704">8-bit ASCII Basis Bit Streams</p></div>
    670670<table class="table" xml:id="xml-bits">
    671671<colgroup span="1">
     
    734734</table>
    735735</div>
    736 <p id="idp414544"> Consider, for example, the XML source data stream shown in the first line of <a class="xref" href="#derived">Table IV</a>.
     736<p id="idp414416"> Consider, for example, the XML source data stream shown in the first line of <a class="xref" href="#derived">Table IV</a>.
    737737The remaining lines of this figure show
    738738            several parallel bit streams that are computed in Parabix-style parsing, with each bit
     
    748748<div class="table-wrapper" id="derived">
    749749<p class="title">Table IV</p>
    750 <div class="caption"><p id="idp418912">XML Source Data and Derived Parallel Bit Streams</p></div>
     750<div class="caption"><p id="idp418560">XML Source Data and Derived Parallel Bit Streams</p></div>
    751751<table class="table" xml:id="derived">
    752752<colgroup span="1">
     
    798798</table>
    799799</div>
    800 <p id="idp431968"> Two intuitions may help explain how the Parabix approach can lead to improved XML
     800<p id="idp431488"> Two intuitions may help explain how the Parabix approach can lead to improved XML
    801801            parsing performance. The first is that the use of the full register width offers a
    802802            considerable information advantage over sequential byte-at-a-time parsing. That is,
     
    807807            individual decision-bits, an approach that computes many of them in parallel (e.g., 128
    808808            bytes at a time using 128-bit registers) should provide substantial benefit. </p>
    809 <p id="idp433216"> Previous studies have shown that the Parabix approach improves many aspects of XML
     809<p id="idp432736"> Previous studies have shown that the Parabix approach improves many aspects of XML
    810810            processing, including transcoding \cite{Cameron2008}, character classification and
    811811            validation, tag parsing and well-formedness checking. The first Parabix parser used
     
    816816            \cite{HPCA2012}. Although these research prototypes handled the full syntax of
    817817            schema-less XML documents, they lacked the functionality required by full XML parsers. </p>
    818 <p id="idp435344"> Commercial XML processors support transcoding of multiple character sets and can
     818<p id="idp434864"> Commercial XML processors support transcoding of multiple character sets and can
    819819            parse and validate against multiple document vocabularies. Additionally, they provide
    820820            API facilities beyond those found in research prototypes, including the widely used SAX,
    821821            SAX2 and DOM interfaces. </p>
    822822</div>
    823 <div class="section" id="idp436192">
     823<div class="section" id="idp435712">
    824824<h3 class="title" style="clear: both">Sequential vs. Parallel Paradigm</h3>
    825 <p id="idp436832"> Xerces—like all traditional XML parsers—processes XML documents
     825<p id="idp436400"> Xerces—like all traditional XML parsers—processes XML documents
    826826            sequentially. Each character is examined to distinguish between the XML-specific markup,
    827827            such as a left angle bracket <code class="code">&lt;</code>, and the content held within the
    828828            document. As the parser progresses through the document, it alternates between markup
    829829            scanning, validation and content processing modes. </p>
    830 <p id="idp438400"> In other words, Xerces belongs to an equivalent class applications termed FSM
     830<p id="idp437968"> In other words, Xerces belongs to an equivalent class applications termed FSM
    831831            applications\footnote{ Herein FSM applications are considered software systems whose
    832832            behaviour is defined by the inputs, current state and the events associated with
     
    834834            subsequent characters. Unfortunately, textual data tends to be unpredictable and any
    835835            character could induce a state transition. </p>
    836 <p id="idp439312"> Parabix-style XML parsers utilize a concept of layered processing. A block of source
     836<p id="idp438880"> Parabix-style XML parsers utilize a concept of layered processing. A block of source
    837837            text is transformed into a set of lexical bitstreams, which undergo a series of
    838838            operations that can be grouped into logical layers, e.g., transposition, character
     
    843843</div>
    844844</div>
    845 <div class="section" id="idp440608">
     845<div class="section" id="architecture">
    846846<h2 class="title" style="clear: both">Architecture</h2>
    847 <div class="section" id="idp441280">
     847<div class="section" id="idp441104">
    848848<h3 class="title" style="clear: both">Overview</h3>
    849 <p id="idp442336"> icXML is more than an optimized version of Xerces. Many components were grouped,
     849<p id="idp442160"> icXML is more than an optimized version of Xerces. Many components were grouped,
    850850            restructured and rearchitected with pipeline parallelism in mind. In this section, we
    851851            highlight the core differences between the two systems. As shown in Figure
    852             \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the transcoder, reader,
     852              <a class="xref" href="#xerces-arch" title="Xerces Architecture">Figure 1</a>, Xerces is comprised of five main modules: the transcoder, reader,
    853853            scanner, namespace binder, and validator. The <span class="ital">Transcoder</span> converts source data into UTF-16 before Xerces parses it as XML;
    854854            the majority of the character set encoding validation is performed as a byproduct of
     
    873873<p class="title">Figure 1: Xerces Architecture</p>
    874874<div class="figure-contents">
    875 <div class="mediaobject" id="idp450016"><img alt="png image (xerces.png)" src="xerces.png" width="150cm"></div>
     875<div class="mediaobject" id="idp450128"><img alt="png image (xerces.png)" src="xerces.png" width="150cm"></div>
    876876<div class="caption"></div>
    877877</div>
    878878</div>
    879 <p id="idp452336"> In icXML functions are grouped into logical components. As shown in Figure
     879<p id="idp452496"> In icXML functions are grouped into logical components. As shown in
    880880             <a class="xref" href="#xerces-arch" title="Xerces Architecture">Figure 1</a>, two major categories exist: (1) the Parabix Subsystem and (2) the
    881881            Markup Processor. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which
    882882            represents data as a set of parallel bitstreams. The <span class="ital">Character Set
    883                Adapter</span>, discussed in Section \ref{arch:character-set-adapter}, mirrors
     883              Adapter</span>, discussed in <a class="xref" href="#character-set-adapter" title="Character Set Adapters">section “Character Set Adapters”</a>, mirrors
    884884            Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of
    885             lexical bitstreams, similar to those shown in Figure \ref{fig:parabix1}. These lexical
     885              lexical bitstreams, similar to those shown in . These lexical
    886886            bitstreams are later transformed into UTF-16 in the Content Stream Generator, after
    887887            additional processing is performed. The first precursor to producing UTF-16 is the
     
    894894            icXML must provide the Line and Column position of each error. The <span class="ital">Line-Column Tracker</span> uses the lexical information to keep track of the
    895895            document position(s) through the use of an optimized population count algorithm,
    896             described in Section \ref{section:arch:errorhandling}. From here, two data-independent
     896              described in <a class="xref" href="#errorhandling" title="Error Handling">section “Error Handling”</a>. From here, two data-independent
    897897            branches exist: the Symbol Resolver and Content Preparation Unit. </p>
    898 <p id="idp457328"> A typical XML file contains few unique element and attribute names—but
     898<p id="idp459328"> A typical XML file contains few unique element and attribute names—but
    899899            each of them will occur frequently. icXML stores these as distinct data structures,
    900900            called symbols, each with their own global identifier (GID). Using the symbol marker
     
    902902               Resolver</span> scans through the raw data to produce a sequence of GIDs, called
    903903            the <span class="ital">symbol stream</span>. </p>
    904 <p id="idp459968"> The final components of the Parabix Subsystem are the <span class="ital">Content
     904<p id="idp461984"> The final components of the Parabix Subsystem are the <span class="ital">Content
    905905               Preparation Unit</span> and <span class="ital">Content Stream
    906906            Generator</span>. The former takes the (transposed) basis bitstreams and selectively
    907907            filters them, according to the information provided by the Parallel Markup Parser, and
    908             the latter transforms the filtered streams into the tagged UTF-16 <span class="ital">content stream</span>, discussed in Section \ref{section:arch:contentstream}. </p>
    909 <p id="idp462880"> Combined, the symbol and content stream form icXML's compressed IR of the XML
     908            the latter transforms the filtered streams into the tagged UTF-16 <span class="ital">content stream</span>, discussed in <a class="xref" href="#contentstream" title="Content Stream">section “Content Stream”</a>. </p>
     909<p id="idp465584"> Combined, the symbol and content stream form icXML's compressed IR of the XML
    910910            document. The <span class="ital">Markup Processor</span>~parses the IR to
    911911            validate and produce the sequential output for the end user. The <span class="ital">Final WF checker</span> performs inter-element well-formedness validation that
     
    914914            that produces a series of URI identifiers (URI IDs), the <span class="ital">URI
    915915               stream</span>, which are associated with each symbol occurrence. This is
    916             discussed in Section \ref{section:arch:namespacehandling}. Finally, the <span class="ital">Validation</span> layer implements the Xerces's validator. However,
     916                 discussed in <a class="xref" href="#namespace-handling" title="Namespace Handling">section “Namespace Handling”</a>. Finally, the <span class="ital">Validation</span> layer implements the Xerces's validator. However,
    917917            preprocessing associated with each symbol greatly reduces the work of this stage. </p>
    918918<div class="figure" id="icxml-arch">
    919919<p class="title">Figure 2: icXML Architecture</p>
    920920<div class="figure-contents">
    921 <div class="mediaobject" id="idp468816"><img alt="png image (icxml.png)" src="icxml.png" width="500cm"></div>
     921<div class="mediaobject" id="idp472048"><img alt="png image (icxml.png)" src="icxml.png" width="500cm"></div>
    922922<div class="caption"></div>
    923923</div>
    924924</div>
    925925</div>
    926 <div class="section" id="idp471264">
     926<div class="section" id="character-set-adapter">
    927927<h3 class="title" style="clear: both">Character Set Adapters</h3>
    928 <p id="idp471936"> In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of
     928<p id="idp475536"> In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of
    929929            Xerces itself and provide the end-consumer with a single encoding format. In the
    930930            important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
     
    933933            other cases, transcoding may involve table look-up operations for each byte of input. In
    934934            any case, transcoding imposes at least a cost of buffer copying. </p>
    935 <p id="idp472992"> In icXML, however, the concept of Character Set Adapters (CSAs) is used to minimize
     935<p id="idp476592"> In icXML, however, the concept of Character Set Adapters (CSAs) is used to minimize
    936936            transcoding costs. Given a specified input encoding, a CSA is responsible for checking
    937937            that input code units represent valid characters, mapping the characters of the encoding
     
    939939            item streams), as well as supporting ultimate transcoding requirements. All of this work
    940940            is performed using the parallel bitstream representation of the source input. </p>
    941 <p id="idp473968"> An important observation is that many character sets are an extension to the legacy
     941<p id="idp477568"> An important observation is that many character sets are an extension to the legacy
    942942            7-bit ASCII character set. This includes the various ISO Latin character sets, UTF-8,
    943943            UTF-16 and many others. Furthermore, all significant characters for parsing XML are
    944944            confined to the ASCII repertoire. Thus, a single common set of lexical item calculations
    945945            serves to compute lexical item streams for all such ASCII-based character sets. </p>
    946 <p id="idp474848"> A second observation is that—regardless of which character set is
     946<p id="idp478448"> A second observation is that—regardless of which character set is
    947947            used—quite often all of the characters in a particular block of input will be
    948948            within the ASCII range. This is a very simple test to perform using the bitstream
     
    951951            be skipped. Transcoding to UTF-16 becomes trivial as the high eight bitstreams of the
    952952            UTF-16 form are each set to zero in this case. </p>
    953 <p id="idp476640"> A third observation is that repeated transcoding of the names of XML elements,
     953<p id="idp480160"> A third observation is that repeated transcoding of the names of XML elements,
    954954            attributes and so on can be avoided by using a look-up mechanism. That is, the first
    955955            occurrence of each symbol is stored in a look-up table mapping the input encoding to a
     
    958958            symbol look up is required to apply various XML validation rules, there is achieves the
    959959            effect of transcoding each occurrence without additional cost. </p>
    960 <p id="idp477696"> The cost of individual character transcoding is avoided whenever a block of input is
     960<p id="idp481216"> The cost of individual character transcoding is avoided whenever a block of input is
    961961            confined to the ASCII subset and for all but the first occurrence of any XML element or
    962962            attribute name. Furthermore, when transcoding is required, the parallel bitstream
     
    969969            using bit scan operations. </p>
    970970</div>
    971 <div class="section" id="idp479120">
     971<div class="section" id="par-filter">
    972972<h3 class="title" style="clear: both">Combined Parallel Filtering</h3>
    973 <p id="idp479808"> As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last
     973<p id="idp483728"> As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last
    974974            bytes of multi-byte UTF-8 sequences as positions for deletion. For example, the two
    975975            Chinese characters <code class="code">䜠奜</code> are represented as two
     
    985985            may then be completed by applying parallel deletion and inverse transposition of the
    986986            UTF-16 bitstreams\cite{Cameron2008}. </p>
    987 <p id="idp483968"> Rather than immediately paying the costs of deletion and transposition just for
     987<p id="idp487856"> Rather than immediately paying the costs of deletion and transposition just for
    988988            transcoding, however, icXML defers these steps so that the deletion masks for several
    989989            stages of processing may be combined. In particular, this includes core XML requirements
     
    10001000<div class="figure-contents">
    10011001<div class="caption">Line Break Normalization Logic</div>
    1002 <pre class="programlisting" id="idp487952">
     1002<pre class="programlisting" id="idp491888">
    10031003# XML 1.0 line-break normalization rules.
    10041004if lex.CR:
     
    10161016</div>
    10171017         </p>
    1018 <p id="idp489424"> In essence, the deletion masks for transcoding and for line break normalization each
     1018<p id="idp493360"> In essence, the deletion masks for transcoding and for line break normalization each
    10191019            represent a bitwise filter; these filters can be combined using bitwise-or so that the
    10201020            parallel deletion algorithm need only be applied once. </p>
    1021 <p id="idp490080"> A further application of combined filtering is the processing of XML character and
     1021<p id="idp494016"> A further application of combined filtering is the processing of XML character and
    10221022            entity references. Consider, for example, the references <code class="code">&amp;</code> or
    10231023               <code class="code">&lt;</code>. which must be replaced in XML processing with the single
     
    10321032            UTF-16 code unit. In the case, that this is not true, it is addressed in
    10331033            post-processing. </p>
    1034 <p id="idp494832"> The final step of combined filtering occurs during the process of reducing markup
     1034<p id="idp498768"> The final step of combined filtering occurs during the process of reducing markup
    10351035            data to tag bytes preceding each significant XML transition as described in
    1036             section~\ref{section:arch:contentstream}. Overall, icXML avoids separate buffer copying
     1036              <a class="xref" href="#contentstream" title="Content Stream">section “Content Stream”</a>. Overall, icXML avoids separate buffer copying
    10371037            operations for each of the these filtering steps, paying the cost of parallel deletion
    10381038            and inverse transposition only once. Currently, icXML employs the parallel-prefix
     
    10421042            Haswell architecture. </p>
    10431043</div>
    1044 <div class="section" id="idp496160">
     1044<div class="section" id="contentstream">
    10451045<h3 class="title" style="clear: both">Content Stream</h3>
    1046 <p id="idp496832"> A relatively-unique concept for icXML is the use of a filtered content stream.
     1046<p id="idp501744"> A relatively-unique concept for icXML is the use of a filtered content stream.
    10471047            Rather that parsing an XML document in its original format, the input is transformed
    10481048            into one that is easier for the parser to iterate through and produce the sequential
     
    10511051            is transformed into
    10521052           
    1053             through the parallel filtering algorithm, described in section \ref{sec:parfilter}. </p>
    1054 <p id="idp499344"> Combined with the symbol stream, the parser traverses the content stream to
     1053            through the parallel filtering algorithm, described in <a class="xref" href="#par-filter" title="Combined Parallel Filtering">section “Combined Parallel Filtering”</a>. </p>
     1054<p id="idp504928"> Combined with the symbol stream, the parser traverses the content stream to
    10551055            effectively reconstructs the input document in its output form. The initial <span class="ital">0</span> indicates an empty content string. The following
    10561056               <code class="code">&gt;</code> indicates that a start tag without any attributes is the first
     
    10591059            in accordance with the Xerces API specification. Unlike Xerces, no memory-copy
    10601060            operations are required to produce these strings, which as
    1061             Figure~\ref{fig:xerces-profile} shows accounts for 6.83% of Xerces's execution time.
     1061              <a class="xref" href="#xerces-profile">Table I</a> shows accounts for 6.83% of Xerces's execution time.
    10621062            Additionally, it is cheap to locate the terminal character of each string: using the
    10631063            String End bitstream, the Parabix Subsystem can effectively calculate the offset of each
    10641064            null character in the content stream in parallel, which in turn means the parser can
    10651065            directly jump to the end of every string without scanning for it. </p>
    1066 <p id="idp502736"> Following <code class="code">'fee'</code> is a <code class="code">=</code>, which marks the
     1066<p id="idp509008"> Following <code class="code">'fee'</code> is a <code class="code">=</code>, which marks the
    10671067            existence of an attribute. Because all of the intra-element was performed in the Parabix
    10681068            Subsystem, this must be a legal attribute. Since attributes can only occur within start
     
    10781078            that the appropriate scope-nesting rules have been applied. </p>
    10791079</div>
    1080 <div class="section" id="idp506880">
     1080<div class="section" id="namespace-handling">
    10811081<h3 class="title" style="clear: both">Namespace Handling</h3>
    1082 <p id="idp507968"> In XML, namespaces prevents naming conflicts when multiple vocabularies are used
     1082<p id="idp514576"> In XML, namespaces prevents naming conflicts when multiple vocabularies are used
    10831083            together. It is especially important when a vocabulary application-dependant meaning,
    10841084            such as when XML or SVG documents are embedded within XHTML files. Namespaces are bound
    10851085            to uniform resource identifiers (URIs), which are strings used to identify specific
    1086             names or resources. On line 1 in the Table below, the <code class="code">xmlns</code>
     1086            names or resources. On line 1 in <a class="xref" href="#namespace-ex">Table V</a>, the <code class="code">xmlns</code>
    10871087            attribute instructs the XML processor to bind the prefix <code class="code">p</code> to the URI
    10881088               '<code class="code">pub.net</code>' and the default (empty) prefix to
     
    10971097            uniquely-named items because the current vocabulary is determined by the namespace(s)
    10981098            that are in-scope. </p>
    1099 <div class="table-wrapper" id="idp515088">
     1099<div class="table-wrapper" id="namespace-ex">
    11001100<p class="title">Table V</p>
    1101 <div class="caption"><p id="idp515600">XML Namespace Example</p></div>
    1102 <table class="table">
     1101<div class="caption"><p id="idp523360">XML Namespace Example</p></div>
     1102<table class="table" xml:id="namespace-ex">
    11031103<colgroup span="1">
    11041104<col align="centre" valign="top" span="1">
     
    11331133</table>
    11341134</div>
    1135 <p id="idp524576"> In both Xerces and icXML, every URI has a one-to-one mapping to a URI ID. These
     1135<p id="idp532304"> In both Xerces and icXML, every URI has a one-to-one mapping to a URI ID. These
    11361136            persist for the lifetime of the application through the use of a global URI pool. Xerces
    11371137            maintains a stack of namespace scopes that is pushed (popped) every time a start tag
     
    11411141            those that declare a set of namespaces upfront and never change them, and (2) those that
    11421142            repeatedly modify the namespaces in predictable patterns. </p>
    1143 <p id="idp525712"> For that reason, icXML contains an independent namespace stack and utilizes bit
     1143<p id="idp533440"> For that reason, icXML contains an independent namespace stack and utilizes bit
    11441144            vectors to cheaply perform
    11451145             When a prefix is
     
    11481148            process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and
    11491149            every prefix contains a bit vector marking every NSID that has ever been associated with
    1150             it within the document. For example, in Table \ref{tbl:namespace1}, the prefix binding
     1150              it within the document. For example, in <a class="xref" href="#namespace-ex">Table V</a>, the prefix binding
    11511151            set of <code class="code">p</code> and <code class="code">xmlns</code> would be <code class="code">01</code> and
    11521152            <code class="code">11</code> respectively. To resolve the in-scope namespace binding for each prefix,
    11531153            a bit vector of the currently visible namespaces is maintained by the system. By ANDing
    11541154            the prefix bit vector with the currently visible namespaces, the in-scope NSID can be
    1155             found using a bit-scan intrinsic. A namespace binding table, similar to Table
    1156             \ref{tbl:namespace1}, provides the actual URI ID. </p>
    1157 <div class="table-wrapper" id="idp530224">
     1155            found using a bit-scan intrinsic. A namespace binding table, similar to
     1156            <a class="xref" href="#namespace-binding">Table VI</a>, provides the actual URI ID. </p>
     1157<div class="table-wrapper" id="namespace-binding">
    11581158<p class="title">Table VI</p>
    1159 <div class="caption"><p id="idp530736">Namespace Binding Table Example</p></div>
    1160 <table class="table">
     1159<div class="caption"><p id="idp540032">Namespace Binding Table Example</p></div>
     1160<table class="table" xml:id="namespace-binding">
    11611161<colgroup span="1">
    11621162<col align="centre" valign="top" span="1">
     
    11981198</table>
    11991199</div>
    1200 <p id="idp546928">
     1200<p id="idp556448">
    12011201           
    12021202           
     
    12041204           
    12051205         </p>
    1206 <p id="idp548832"> To ensure that scoping rules are adhered to, whenever a start tag is encountered,
     1206<p id="idp558352"> To ensure that scoping rules are adhered to, whenever a start tag is encountered,
    12071207            any modification to the currently visible namespaces is calculated and stored within a
    12081208            stack of bit vectors denoting the locally modified namespace bindings. When an end tag
     
    12131213         </p>
    12141214</div>
    1215 <div class="section" id="idp550288">
     1215<div class="section" id="errorhandling">
    12161216<h3 class="title" style="clear: both">Error Handling</h3>
    1217 <p id="idp550960">
     1217<p id="idp560784">
    12181218           
    12191219            Xerces outputs error messages in two ways: through the programmer API and as thrown
     
    12221222            type and severity of the error. icXML emits errors in the similar manner—but
    12231223            how it discovers them is substantially different. Recall that in Figure
    1224             \ref{fig:icxml-arch}, icXML is divided into two sections: the Parabix Subsystem and
     1224            <a class="xref" href="#icxml-arch" title="icXML Architecture">Figure 2</a>, icXML is divided into two sections: the Parabix Subsystem and
    12251225            Markup Processor, each with its own system for detecting and producing error messages. </p>
    1226 <p id="idp552592"> Within the Parabix Subsystem, all computations are performed in parallel, a block at
     1226<p id="idp563280"> Within the Parabix Subsystem, all computations are performed in parallel, a block at
    12271227            a time. Errors are derived as artifacts of bitstream calculations, with a 1-bit marking
    12281228            the byte-position of an error within a block, and the type of error is determined by the
     
    12571257            detected, the sum of those skipped positions is subtracted from the distance to
    12581258            determine the actual column number. </p>
    1259 <p id="idp558080"> The Markup Processor is a state-driven machine. As such, error detection within it
     1259<p id="idp568800"> The Markup Processor is a state-driven machine. As such, error detection within it
    12601260            is very similar to Xerces. However, reporting the correct line/column is a much more
    12611261            difficult problem. The Markup Processor parses the content stream, which is a series of
     
    12711271</div>
    12721272</div>
    1273 <div class="section" id="idp560576">
     1273<div class="section" id="multithread">
    12741274<h2 class="title" style="clear: both">Multithreading with Pipeline Parallelism</h2>
    1275 <p id="idp561216"> As discussed in section \ref{background:xerces}, Xerces can be considered a FSM
     1275<p id="idp572336"> As discussed in section <a class="xref" href="#background-xerces" title="Xerces C++ Structure">section “Xerces C++ Structure”</a>, Xerces can be considered a FSM
    12761276         application. These are "embarrassingly
    12771277         sequential."\cite{Asanovic:EECS-2006-183} and notoriously difficult to
     
    12811281         well into the general model of pipeline parallelism, in which each thread is in charge of a
    12821282         single module or group of modules. </p>
    1283 <p id="idp563072"> The most straightforward division of work in icXML is to separate the Parabix Subsystem
     1283<p id="idp574752"> The most straightforward division of work in icXML is to separate the Parabix Subsystem
    12841284         and the Markup Processor into distinct logical layers into two separate stages. The
    12851285         resultant application, <span class="ital">icXML-p</span>, is a course-grained
     
    12911291         and grammar-based validation, and the provides parsed XML data to the application through
    12921292         the Xerces API. The shared data structure is implemented using a ring buffer, where every
    1293          entry contains an independent set of data streams. In the examples of Figure
    1294          \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries. A
     1293         entry contains an independent set of data streams. In the examples of
     1294           <a class="xref" href="#threads_timeline1" title="Thread Balance in Two-Stage Pipelines: Stage 1 Dominant">Figure 4</a>, the ring buffer has four entries. A
    12951295         lock-free mechanism is applied to ensure that each entry can only be read or written by one
    1296          thread at the same time. In Figure \ref{threads_timeline1} the processing time of
     1296         thread at the same time. In  <a class="xref" href="#threads_timeline1" title="Thread Balance in Two-Stage Pipelines: Stage 1 Dominant">Figure 4</a> the processing time of
    12971297               <code class="code">T<sub>1</sub></code> is longer than
    12981298         <code class="code">T<sub>2</sub></code>; thus <code class="code">T<sub>2</sub></code> always
    1299          waits for <code class="code">T<sub>1</sub></code> to write to the shared memory. Figure
    1300          \ref{threads_timeline2} illustrates the scenario in which
     1299         waits for <code class="code">T<sub>1</sub></code> to write to the shared memory.  
     1300         <a class="xref" href="#threads_timeline2" title="Thread Balance in Two-Stage Pipelines: Stage 2 Dominant">Figure 5</a> illustrates the scenario in which
    13011301         <code class="code">T<sub>1</sub></code> is faster and must wait for
    13021302            <code class="code">T<sub>2</sub></code> to finish reading the shared data before it can
    13031303         reuse the memory space. </p>
    1304 <p id="idp572752">
     1304<p id="idp585824">
    13051305        <div class="figure" id="threads_timeline1">
    1306 <p class="title">Figure 4: Thread Balance in Two-Stage Pipelines</p>
    1307 <div class="figure-contents">
    1308 <div class="mediaobject" id="idp574096"><img alt="png image (threads_timeline1.png)" src="threads_timeline1.png" width="500cm"></div>
    1309 <div class="mediaobject" id="idp575872"><img alt="png image (threads_timeline2.png)" src="threads_timeline2.png" width="500cm"></div>
    1310 <div class="caption"></div>
    1311 </div>
     1306<p class="title">Figure 4: Thread Balance in Two-Stage Pipelines: Stage 1 Dominant</p>
     1307<div class="figure-contents"><div class="mediaobject" id="idp587152"><img alt="png image (threads_timeline1.png)" src="threads_timeline1.png" width="500cm"></div></div>
     1308</div>
     1309        <div class="figure" id="threads_timeline2">
     1310<p class="title">Figure 5: Thread Balance in Two-Stage Pipelines: Stage 2 Dominant</p>
     1311<div class="figure-contents"><div class="mediaobject" id="idp590160"><img alt="png image (threads_timeline2.png)" src="threads_timeline2.png" width="500cm"></div></div>
    13121312</div>
    13131313      </p>
    1314 <p id="idp578320"> Overall, our design is intended to benefit a range of applications. Conceptually, we
     1314<p id="idp592192"> Overall, our design is intended to benefit a range of applications. Conceptually, we
    13151315         consider two design points. The first, the parsing performed by the Parabix Subsystem
    13161316         dominates at 67% of the overall cost, with the cost of application processing (including
     
    13181318         scenario, the cost of application processing dominates at 60%, while the cost of XML
    13191319         parsing represents an overhead of 40%. </p>
    1320 <p id="idp579232"> Our design is predicated on a goal of using the Parabix framework to achieve a 50% to
     1320<p id="idp593104"> Our design is predicated on a goal of using the Parabix framework to achieve a 50% to
    13211321         100% improvement in the parsing engine itself. In a best case scenario, a 100% improvement
    13221322         of the Parabix Subsystem for the design point in which XML parsing dominates at 67% of the
     
    13261326         about 33% of the original work. In this case, Amdahl's law predicts that we could expect up
    13271327         to a 3x speedup at best. </p>
    1328 <p id="idp580352"> At the other extreme of our design range, we consider an application in which core
     1328<p id="idp594224"> At the other extreme of our design range, we consider an application in which core
    13291329         parsing cost is 40%. Assuming the 2x speedup of the Parabix Subsystem over the
    13301330         corresponding Xerces core, single-threaded icXML delivers a 25% speedup. However, the most
     
    13321332         the entire latency of parsing within the serial time required by the application. In this
    13331333         case, we achieve an overall speedup in processing time by 1.67x. </p>
    1334 <p id="idp581296"> Although the structure of the Parabix Subsystem allows division of the work into
     1334<p id="idp595168"> Although the structure of the Parabix Subsystem allows division of the work into
    13351335         several pipeline stages and has been demonstrated to be effective for four pipeline stages
    13361336         in a research prototype \cite{HPCA2012}, our analysis here suggests that the further
     
    13401340         the cost of application logic that could match reductions in core parsing cost. </p>
    13411341</div>
    1342 <div class="section" id="idp582480">
     1342<div class="section" id="performance">
    13431343<h2 class="title" style="clear: both">Performance</h2>
    1344 <p id="idp583152"> We evaluate Xerces-C++ 3.1.1, icXML, icXML-p against two benchmarking applications: the
     1344<p id="idp598112"> We evaluate Xerces-C++ 3.1.1, icXML, icXML-p against two benchmarking applications: the
    13451345         Xerces C++ SAXCount sample application, and a real world GML to SVG transformation
    13461346         application. We investigated XML parser performance using an Intel Core i7 quad-core (Sandy
     
    13481348         L1 cache, 256 kB (per core) L2 cache, 8 MB L3 cache) running the 64-bit version of Ubuntu
    13491349         12.04 (Linux). </p>
    1350 <p id="idp584064"> We analyzed the execution profiles of each XML parser using the performance counters
     1350<p id="idp599024"> We analyzed the execution profiles of each XML parser using the performance counters
    13511351         found in the processor. We chose several key hardware events that provide insight into the
    13521352         profile of each application and indicate if the processor is doing useful work. The set of
     
    13561356         collection of hardware performance monitoring statistics. In addition, we used the Linux
    13571357         perf \cite{perf} utility to collect per core hardware events. </p>
    1358 <div class="section" id="idp585200">
     1358<div class="section" id="idp600160">
    13591359<h3 class="title" style="clear: both">Xerces C++ SAXCount</h3>
    1360 <p id="idp585872"> Xerces comes with sample applications that demonstrate salient features of the
     1360<p id="idp600832"> Xerces comes with sample applications that demonstrate salient features of the
    13611361            parser. SAXCount is the simplest such application: it counts the elements, attributes
    13621362            and characters of a given XML file using the (event based) SAX API and prints out the
    13631363            totals. </p>
    1364 <p id="idp586576"> Table \ref{XMLDocChars} shows the document characteristics of the XML input files
     1364<p id="idp601536"> <a class="xref" href="#XMLdocs">Table VII</a> shows the document characteristics of the XML input files
    13651365            selected for the Xerces C++ SAXCount benchmark. The jaw.xml represents document-oriented
    13661366            XML inputs and contains the three-byte and four-byte UTF-8 sequence required for the
    13671367            UTF-8 encoding of Japanese characters. The remaining data files are data-oriented XML
    13681368            documents and consist entirely of single byte encoded ASCII characters.
    1369   <div class="table-wrapper" id="idp587312">
     1369  <div class="table-wrapper" id="XMLdocs">
    13701370<p class="title">Table VII</p>
    1371 <div class="caption"><p id="idp587824">XML Document Characteristics</p></div>
    1372 <table class="table">
     1371<div class="caption"><p id="idp603872">XML Document Characteristics</p></div>
     1372<table class="table" xml:id="XMLdocs">
    13731373<colgroup span="1">
    13741374<col align="left" valign="top" span="1">
     
    14181418</div>           
    14191419</p>
    1420 <p id="idp603408"> A key predictor of the overall parsing performance of an XML file is markup
     1420<p id="idp619472"> A key predictor of the overall parsing performance of an XML file is markup
    14211421            density\footnote{ Markup Density: the ratio of markup bytes used to define the structure
    14221422            of the document vs. its file size.}. This metric has substantial influence on the
     
    14251425            of document-oriented and data-oriented XML files to analyze performance over a spectrum
    14261426            of markup densities. </p>
    1427 <p id="idp604416"> Figure \ref{perf_SAX} compares the performance of Xerces, icXML and pipelined icXML
     1427<p id="idp621088"> <a class="xref" href="#perf_SAX" title="SAXCount Performance Comparison">Figure 6</a> compares the performance of Xerces, icXML and pipelined icXML
    14281428            in terms of CPU cycles per byte for the SAXCount application. The speedup for icXML over
    14291429            Xerces is 1.3x to 1.8x. With two threads on the multicore machine, icXML-p can achieve
     
    14321432            icXML-p performs better as markup-density increases because the work performed by each
    14331433            stage is well balanced in this application. </p>
    1434 <p id="idp606096">
     1434<p id="idp622832">
    14351435        <div class="figure" id="perf_SAX">
    1436 <p class="title">Figure 5: SAXCount Performance Comparison</p>
     1436<p class="title">Figure 6: SAXCount Performance Comparison</p>
    14371437<div class="figure-contents">
    1438 <div class="mediaobject" id="idp607440"><img alt="png image (perf_SAX.png)" src="perf_SAX.png" width="500cm"></div>
     1438<div class="mediaobject" id="idp624144"><img alt="png image (perf_SAX.png)" src="perf_SAX.png" width="500cm"></div>
    14391439<div class="caption"></div>
    14401440</div>
     
    14421442         </p>
    14431443</div>
    1444 <div class="section" id="idp609984">
     1444<div class="section" id="idp626688">
    14451445<h3 class="title" style="clear: both">GML2SVG</h3>
    1446 <p id="idp610656">       As a more substantial application of XML processing, the GML-to-SVG (GML2SVG) application
     1446<p id="idp627360">       As a more substantial application of XML processing, the GML-to-SVG (GML2SVG) application
    14471447was chosen.   This application transforms geospatially encoded data represented using
    14481448an XML representation in the form of Geography Markup Language (GML) \cite{lake2004geography}
     
    14561456a known XML format for the purpose of analysis and restructuring to meet
    14571457the requirements of an alternative format.</p>
    1458 <p id="idp612032">Our GML to SVG data translations are executed on GML source data
     1458<p id="idp628688">Our GML to SVG data translations are executed on GML source data
    14591459modelling the city of Vancouver, British Columbia, Canada.
    14601460The GML source document set
     
    14641464213.4 MB of source GML data generates 91.9 MB of target SVG data.</p>
    14651465<div class="figure" id="perf_GML2SVG">
    1466 <p class="title">Figure 6: Performance Comparison for GML2SVG</p>
     1466<p class="title">Figure 7: Performance Comparison for GML2SVG</p>
    14671467<div class="figure-contents">
    1468 <div class="mediaobject" id="idp614016"><img alt="png image (Throughput.png)" src="Throughput.png" width="500cm"></div>
     1468<div class="mediaobject" id="idp630672"><img alt="png image (Throughput.png)" src="Throughput.png" width="500cm"></div>
    14691469<div class="caption"></div>
    14701470</div>
    14711471</div>
    1472 <p id="idp616352">Figure \ref{perf_GML2SVG} compares the performance of the GML2SVG application linked against
     1472<p id="idp632960"><a class="xref" href="#perf_GML2SVG" title="Performance Comparison for GML2SVG">Figure 7</a> compares the performance of the GML2SVG application linked against
    14731473the Xerces, icXML and icXML-p.   
    14741474On the GML workload with this application, single-thread icXML
     
    14771477Using icXML-p, a further throughput increase to 111 MB/sec was recorded,
    14781478approximately a 2X speedup.</p>
    1479 <p id="idp617168">An important aspect of icXML is the replacement of much branch-laden
     1479<p id="idp634368">An important aspect of icXML is the replacement of much branch-laden
    14801480sequential code inside Xerces with straight-line SIMD code using far
    1481 fewer branches.  Figure \ref{branchmiss_GML2SVG} shows the corresponding
     1481fewer branches.  <a class="xref" href="#branchmiss_GML2SVG" title="Comparative Branch Misprediction Rate">Figure 8</a> shows the corresponding
    14821482improvement in branching behaviour, with a dramatic reduction in branch misses per kB.
    14831483It is also interesting to note that icXML-p goes even further.   
     
    14861486less overloaded and able to increase the successful branch prediction rate.</p>
    14871487<div class="figure" id="branchmiss_GML2SVG">
    1488 <p class="title">Figure 7: Comparative Branch Misprediction Rate</p>
     1488<p class="title">Figure 8: Comparative Branch Misprediction Rate</p>
    14891489<div class="figure-contents">
    1490 <div class="mediaobject" id="idp619232"><img alt="png image (BM.png)" src="BM.png" width="500cm"></div>
     1490<div class="mediaobject" id="idp637104"><img alt="png image (BM.png)" src="BM.png" width="500cm"></div>
    14911491<div class="caption"></div>
    14921492</div>
    14931493</div>
    1494 <p id="idp621520">The behaviour of the three versions with respect to L1 cache misses per kB is shown
    1495 in Figure \ref{cachemiss_GML2SVG}.   Improvements are shown in both instruction-
     1494<p id="idp639392">The behaviour of the three versions with respect to L1 cache misses per kB is shown
     1495in <a class="xref" href="#cachemiss_GML2SVG" title="Comparative Cache Miss Rate">Figure 9</a>.   Improvements are shown in both instruction-
    14961496and data-cache performance with the improvements in instruction-cache
    14971497behaviour the most dramatic.   Single-threaded icXML shows substantially improved
     
    15021502significant benefit.</p>
    15031503<div class="figure" id="cachemiss_GML2SVG">
    1504 <p class="title">Figure 8: Comparative Cache Miss Rate</p>
     1504<p class="title">Figure 9: Comparative Cache Miss Rate</p>
    15051505<div class="figure-contents">
    1506 <div class="mediaobject" id="idp623696"><img alt="png image (CM.png)" src="CM.png" width="500cm"></div>
     1506<div class="mediaobject" id="idp642192"><img alt="png image (CM.png)" src="CM.png" width="500cm"></div>
    15071507<div class="caption"></div>
    15081508</div>
    15091509</div>
    1510 <p id="idp625984">One caveat with this study is that the GML2SVG application did not exhibit
     1510<p id="idp644480">One caveat with this study is that the GML2SVG application did not exhibit
    15111511a relative balance of processing between application code and Xerces library
    15121512code reaching the 33% figure.  This suggests that for this application and
     
    15161516</div>
    15171517</div>
    1518 <div class="section" id="idp627056">
     1518<div class="section" id="conclusion">
    15191519<h2 class="title" style="clear: both">Conclusion and Future Work</h2>
    1520 <p id="idp627744"> This paper is the first case study documenting the significant performance benefits
     1520<p id="idp646640"> This paper is the first case study documenting the significant performance benefits
    15211521         that may be realized through the integration of parallel bitstream technology into existing
    15221522         widely-used software libraries. In the case of the Xerces-C++ XML parser, the combined
     
    15281528         technologies, this is an important case study demonstrating the general feasibility of
    15291529         these techniques. </p>
    1530 <p id="idp629024"> The further development of icXML to move beyond 2-stage pipeline parallelism is
     1530<p id="idp647920"> The further development of icXML to move beyond 2-stage pipeline parallelism is
    15311531         ongoing, with realistic prospects for four reasonably balanced stages within the library.
    15321532         For applications such as GML2SVG which are dominated by time spent on XML parsing, such a
    15331533         multistage pipelined parsing library should offer substantial benefits. </p>
    1534 <p id="idp629792"> The example of XML parsing may be considered prototypical of finite-state machines
     1534<p id="idp648688"> The example of XML parsing may be considered prototypical of finite-state machines
    15351535         applications which have sometimes been considered "embarassingly
    15361536         sequential" and so difficult to parallelize that "nothing
     
    15381538         point in making the case that parallelization can indeed be helpful across a broad array of
    15391539         application types. </p>
    1540 <p id="idp631168"> To overcome the software engineering challenges in applying parallel bitstream
     1540<p id="idp650064"> To overcome the software engineering challenges in applying parallel bitstream
    15411541         technology to existing software systems, it is clear that better library and tool support
    15421542         is needed. The techniques used in the implementation of icXML and documented in this paper
     
    15451545      </p>
    15461546</div>
    1547 <div class="bibliography" id="idp632656">
     1547<div class="bibliography" id="idp652000">
    15481548<h2 class="title" style="clear:both">Bibliography</h2>
    15491549<p class="bibliomixed" id="XMLChip09">[Leventhal and Lemoine 2009] Leventhal, Michael and
  • docs/Balisage13/Bal2013came0601/Bal2013came0601.xml

    r3056 r3057  
    189189      <para>
    190190        The remainder of this paper is organized as follows.   
    191         Section \ref{background} discusses the structure of the Xerces and Parabix XML parsers and the fundamental
     191          <xref linkend="background"/> discusses the structure of the Xerces and Parabix XML parsers and the fundamental
    192192        differences between the two parsing models.   
    193         Section \ref{architecture} then presents the icXML design based on a restructured Xerces architecture to
     193        <xref linkend="architecture"/> then presents the icXML design based on a restructured Xerces architecture to
    194194        incorporate SIMD parallelism using Parabix methods.   
    195         Section \ref{multithread} moves on to consider the multithreading of the icXML architecture
     195        <xref linkend="multithread"/> moves on to consider the multithreading of the icXML architecture
    196196        using the pipeline parallelism model. 
    197         Section \ref{performance} analyzes the performance of both the single-threaded and
     197        <xref linkend="performance"/> analyzes the performance of both the single-threaded and
    198198        multi-threaded versions of icXML in comparison to original Xerces,
    199199        demonstrating substantial end-to-end acceleration of
    200200        a GML-to-SVG translation application written against the Xerces API.
    201         Section \ref{conclusion} concludes the paper with a discussion of future work and the potential for
     201          <xref linkend="conclusion"/> concludes the paper with a discussion of future work and the potential for
    202202        applying the techniques discussed herein in other application domains.
    203203      </para>
    204204   </section>
    205205
    206    <section>
     206   <section xml:id="background">
    207207      <title>Background</title>
    208       <section>
     208      <section xml:id="background-xerces">
    209209         <title>Xerces C++ Structure</title>
    210210         <para> The Xerces C++ parser is a widely-used standards-conformant
     
    414414      </section>
    415415   </section>
    416    <section>
     416   <section xml:id="architecture">
    417417      <title>Architecture</title>
    418418      <section>
     
    422422            restructured and rearchitected with pipeline parallelism in mind. In this section, we
    423423            highlight the core differences between the two systems. As shown in Figure
    424             \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the transcoder, reader,
     424              <xref linkend="xerces-arch"/>, Xerces is comprised of five main modules: the transcoder, reader,
    425425            scanner, namespace binder, and validator. The <emphasis role="ital"
    426426            >Transcoder</emphasis> converts source data into UTF-16 before Xerces parses it as XML;
     
    453453          </caption>
    454454        </figure>
    455          <para> In icXML functions are grouped into logical components. As shown in Figure
     455         <para> In icXML functions are grouped into logical components. As shown in
    456456             <xref linkend="xerces-arch"/>, two major categories exist: (1) the Parabix Subsystem and (2) the
    457457            Markup Processor. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which
    458458            represents data as a set of parallel bitstreams. The <emphasis role="ital">Character Set
    459                Adapter</emphasis>, discussed in Section \ref{arch:character-set-adapter}, mirrors
     459              Adapter</emphasis>, discussed in <xref linkend="character-set-adapter"/>, mirrors
    460460            Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of
    461             lexical bitstreams, similar to those shown in Figure \ref{fig:parabix1}. These lexical
     461              lexical bitstreams, similar to those shown in <xref linkend="parabix1"/>. These lexical
    462462            bitstreams are later transformed into UTF-16 in the Content Stream Generator, after
    463463            additional processing is performed. The first precursor to producing UTF-16 is the
     
    471471               >Line-Column Tracker</emphasis> uses the lexical information to keep track of the
    472472            document position(s) through the use of an optimized population count algorithm,
    473             described in Section \ref{section:arch:errorhandling}. From here, two data-independent
     473              described in <xref linkend="errorhandling"/>. From here, two data-independent
    474474            branches exist: the Symbol Resolver and Content Preparation Unit. </para>
    475475         <para> A typical XML file contains few unique element and attribute names&#8212;but
     
    483483            Generator</emphasis>. The former takes the (transposed) basis bitstreams and selectively
    484484            filters them, according to the information provided by the Parallel Markup Parser, and
    485             the latter transforms the filtered streams into the tagged UTF-16 <emphasis role="ital"
    486                >content stream</emphasis>, discussed in Section \ref{section:arch:contentstream}. </para>
     485            the latter transforms the filtered streams into the tagged UTF-16 <emphasis role="ital">content stream</emphasis>, discussed in <xref linkend="contentstream"/>. </para>
    487486         <para> Combined, the symbol and content stream form icXML's compressed IR of the XML
    488487            document. The <emphasis role="ital">Markup Processor</emphasis>~parses the IR to
     
    494493            that produces a series of URI identifiers (URI IDs), the <emphasis role="ital">URI
    495494               stream</emphasis>, which are associated with each symbol occurrence. This is
    496             discussed in Section \ref{section:arch:namespacehandling}. Finally, the <emphasis
     495                 discussed in <xref linkend="namespace-handling"/>. Finally, the <emphasis
    497496               role="ital">Validation</emphasis> layer implements the Xerces's validator. However,
    498497            preprocessing associated with each symbol greatly reduces the work of this stage. </para>
     
    508507        </figure>
    509508      </section>
    510       <section>
     509      <section xml:id="character-set-adapter">
    511510         <title>Character Set Adapters</title>
    512511         <para> In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of
     
    553552            using bit scan operations. </para>
    554553      </section>
    555       <section>
     554      <section xml:id="par-filter">
    556555         <title>Combined Parallel Filtering</title>
    557556         <para> As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last
     
    615614         <para> The final step of combined filtering occurs during the process of reducing markup
    616615            data to tag bytes preceding each significant XML transition as described in
    617             section~\ref{section:arch:contentstream}. Overall, icXML avoids separate buffer copying
     616              <xref linkend="contentstream"/>. Overall, icXML avoids separate buffer copying
    618617            operations for each of the these filtering steps, paying the cost of parallel deletion
    619618            and inverse transposition only once. Currently, icXML employs the parallel-prefix
     
    623622            Haswell architecture. </para>
    624623      </section>
    625       <section>
     624      <section xml:id="contentstream">
    626625         <title>Content Stream</title>
    627626         <para> A relatively-unique concept for icXML is the use of a filtered content stream.
     
    632631            is transformed into <!-- CODE -->
    633632            <!--``<emphasis role="ital">0</emphasis>\verb`>fee`<emphasis role="ital">0</emphasis>\verb`=fie`<emphasis role="ital">0</emphasis>\verb`=foe`<emphasis role="ital">0</emphasis>\verb`>`<emphasis role="ital">0</emphasis>\verb`/fum`<emphasis role="ital">0</emphasis>\verb`/`''-->
    634             through the parallel filtering algorithm, described in section \ref{sec:parfilter}. </para>
     633            through the parallel filtering algorithm, described in <xref linkend="par-filter"/>. </para>
    635634         <para> Combined with the symbol stream, the parser traverses the content stream to
    636635            effectively reconstructs the input document in its output form. The initial <emphasis
     
    641640            in accordance with the Xerces API specification. Unlike Xerces, no memory-copy
    642641            operations are required to produce these strings, which as
    643             Figure~\ref{fig:xerces-profile} shows accounts for 6.83% of Xerces's execution time.
     642              <xref linkend="xerces-profile"/> shows accounts for 6.83% of Xerces's execution time.
    644643            Additionally, it is cheap to locate the terminal character of each string: using the
    645644            String End bitstream, the Parabix Subsystem can effectively calculate the offset of each
     
    660659            that the appropriate scope-nesting rules have been applied. </para>
    661660      </section>
    662       <section>
     661      <section xml:id="namespace-handling">
    663662         <title>Namespace Handling</title>
    664663         <!-- Should we mention canonical bindings or speculation? it seems like more of an optimization than anything. -->
     
    667666            such as when XML or SVG documents are embedded within XHTML files. Namespaces are bound
    668667            to uniform resource identifiers (URIs), which are strings used to identify specific
    669             names or resources. On line 1 in the Table below, the <code>xmlns</code>
     668            names or resources. On line 1 in <xref linkend="namespace-ex"/>, the <code>xmlns</code>
    670669            attribute instructs the XML processor to bind the prefix <code>p</code> to the URI
    671670               &apos;<code>pub.net</code>&apos; and the default (empty) prefix to
     
    680679            uniquely-named items because the current vocabulary is determined by the namespace(s)
    681680            that are in-scope. </para>
    682 <table>
     681<table xml:id="namespace-ex">
    683682                  <caption>
    684683                     <para>XML Namespace Example</para>
     
    713712            process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and
    714713            every prefix contains a bit vector marking every NSID that has ever been associated with
    715             it within the document. For example, in Table \ref{tbl:namespace1}, the prefix binding
     714              it within the document. For example, in <xref linkend="namespace-ex"/>, the prefix binding
    716715            set of <code>p</code> and <code>xmlns</code> would be <code>01</code> and
    717716            <code>11</code> respectively. To resolve the in-scope namespace binding for each prefix,
    718717            a bit vector of the currently visible namespaces is maintained by the system. By ANDing
    719718            the prefix bit vector with the currently visible namespaces, the in-scope NSID can be
    720             found using a bit-scan intrinsic. A namespace binding table, similar to Table
    721             \ref{tbl:namespace1}, provides the actual URI ID. </para>
    722 <table>
     719            found using a bit-scan intrinsic. A namespace binding table, similar to
     720            <xref linkend="namespace-binding"/>, provides the actual URI ID. </para>
     721<table xml:id="namespace-binding">
    723722                  <caption>
    724723                     <para>Namespace Binding Table Example</para>
     
    756755         </para>
    757756      </section>
    758       <section>
     757      <section xml:id="errorhandling">
    759758         <title>Error Handling</title>
    760759         <para>
     
    765764            type and severity of the error. icXML emits errors in the similar manner&#8212;but
    766765            how it discovers them is substantially different. Recall that in Figure
    767             \ref{fig:icxml-arch}, icXML is divided into two sections: the Parabix Subsystem and
     766            <xref linkend="icxml-arch"/>, icXML is divided into two sections: the Parabix Subsystem and
    768767            Markup Processor, each with its own system for detecting and producing error messages. </para>
    769768         <para> Within the Parabix Subsystem, all computations are performed in parallel, a block at
     
    816815   </section>
    817816
    818    <section>
     817   <section xml:id="multithread">
    819818      <title>Multithreading with Pipeline Parallelism</title>
    820       <para> As discussed in section \ref{background:xerces}, Xerces can be considered a FSM
     819      <para> As discussed in section <xref linkend="background-xerces"/>, Xerces can be considered a FSM
    821820         application. These are &quot;embarrassingly
    822821         sequential.&quot;\cite{Asanovic:EECS-2006-183} and notoriously difficult to
     
    836835         and grammar-based validation, and the provides parsed XML data to the application through
    837836         the Xerces API. The shared data structure is implemented using a ring buffer, where every
    838          entry contains an independent set of data streams. In the examples of Figure
    839          \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries. A
     837         entry contains an independent set of data streams. In the examples of
     838           <xref linkend="threads_timeline1"/>, the ring buffer has four entries. A
    840839         lock-free mechanism is applied to ensure that each entry can only be read or written by one
    841          thread at the same time. In Figure \ref{threads_timeline1} the processing time of
     840         thread at the same time. In  <xref linkend="threads_timeline1"/> the processing time of
    842841               <code>T<subscript>1</subscript></code> is longer than
    843842         <code>T<subscript>2</subscript></code>; thus <code>T<subscript>2</subscript></code> always
    844          waits for <code>T<subscript>1</subscript></code> to write to the shared memory. Figure
    845          \ref{threads_timeline2} illustrates the scenario in which
     843         waits for <code>T<subscript>1</subscript></code> to write to the shared memory.  
     844         <xref linkend="threads_timeline2"/> illustrates the scenario in which
    846845         <code>T<subscript>1</subscript></code> is faster and must wait for
    847846            <code>T<subscript>2</subscript></code> to finish reading the shared data before it can
     
    849848      <para>
    850849        <figure xml:id="threads_timeline1">
    851           <title>Thread Balance in Two-Stage Pipelines</title>
     850          <title>Thread Balance in Two-Stage Pipelines: Stage 1 Dominant</title>
    852851          <mediaobject>
    853852            <imageobject>
     
    855854            </imageobject>
    856855          </mediaobject>
    857           <mediaobject>
     856         </figure>
     857        <figure xml:id="threads_timeline2">
     858          <title>Thread Balance in Two-Stage Pipelines: Stage 2 Dominant</title>
     859        <mediaobject>
    858860            <imageobject>
    859861              <imagedata format="png" fileref="threads_timeline2.png" width="500cm"/>
    860862            </imageobject>
    861863          </mediaobject>
    862           <caption>
    863           </caption>
    864864        </figure>
    865865      </para>
     
    893893   </section>
    894894
    895    <section>
     895   <section xml:id="performance">
    896896      <title>Performance</title>
    897897      <para> We evaluate Xerces-C++ 3.1.1, icXML, icXML-p against two benchmarking applications: the
     
    916916            totals. </para>
    917917
    918          <para> Table \ref{XMLDocChars} shows the document characteristics of the XML input files
     918 <para> <xref linkend="XMLdocs"/> shows the document characteristics of the XML input files
    919919            selected for the Xerces C++ SAXCount benchmark. The jaw.xml represents document-oriented
    920920            XML inputs and contains the three-byte and four-byte UTF-8 sequence required for the
    921921            UTF-8 encoding of Japanese characters. The remaining data files are data-oriented XML
    922922            documents and consist entirely of single byte encoded ASCII characters.
    923   <table>
     923  <table xml:id="XMLdocs">
    924924                  <caption>
    925925                     <para>XML Document Characteristics</para>
     
    948948            of document-oriented and data-oriented XML files to analyze performance over a spectrum
    949949            of markup densities. </para>
    950          <para> Figure \ref{perf_SAX} compares the performance of Xerces, icXML and pipelined icXML
     950         <para> <xref linkend="perf_SAX"/> compares the performance of Xerces, icXML and pipelined icXML
    951951            in terms of CPU cycles per byte for the SAXCount application. The speedup for icXML over
    952952            Xerces is 1.3x to 1.8x. With two threads on the multicore machine, icXML-p can achieve
     
    10031003        </figure>
    10041004       
    1005 <para>Figure \ref{perf_GML2SVG} compares the performance of the GML2SVG application linked against
     1005<para><xref linkend="perf_GML2SVG"/> compares the performance of the GML2SVG application linked against
    10061006the Xerces, icXML and icXML-p.   
    10071007On the GML workload with this application, single-thread icXML
     
    10131013<para>An important aspect of icXML is the replacement of much branch-laden
    10141014sequential code inside Xerces with straight-line SIMD code using far
    1015 fewer branches.  Figure \ref{branchmiss_GML2SVG} shows the corresponding
     1015fewer branches.  <xref linkend="branchmiss_GML2SVG"/> shows the corresponding
    10161016improvement in branching behaviour, with a dramatic reduction in branch misses per kB.
    10171017It is also interesting to note that icXML-p goes even further.   
     
    10321032
    10331033<para>The behaviour of the three versions with respect to L1 cache misses per kB is shown
    1034 in Figure \ref{cachemiss_GML2SVG}.   Improvements are shown in both instruction-
     1034in <xref linkend="cachemiss_GML2SVG"/>.   Improvements are shown in both instruction-
    10351035and data-cache performance with the improvements in instruction-cache
    10361036behaviour the most dramatic.   Single-threaded icXML shows substantially improved
     
    10611061   </section>
    10621062
    1063    <section>
     1063   <section xml:id="conclusion">
    10641064      <title>Conclusion and Future Work</title>
    10651065      <para> This paper is the first case study documenting the significant performance benefits
Note: See TracChangeset for help on using the changeset viewer.