Changeset 2872 for docs/Working/icXML


Ignore:
Timestamp:
Jan 30, 2013, 6:03:41 PM (6 years ago)
Author:
nmedfort
Message:

edits

Location:
docs/Working/icXML
Files:
14 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/abstract.tex

    r2869 r2872  
    11Prior research on the acceleration of XML processing
    2 using SIMD and multicore parallelism has lead to
     2using SIMD and multi-core parallelism has lead to
    33a number of interesting research prototypes.  This work
    44investigates the extent to which the techniques underlying
  • docs/Working/icXML/arch-charactersetadapters.tex

    r2866 r2872  
    66In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
    77because of the need to decode and classify each byte of input, mapping variable-length UTF-8
    8 byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   In other
    9 cases, transcoding may involve table lookup operations for each byte of input.  In any case,
     8byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   
     9In other cases, transcoding may involve table look-up operations for each byte of input.  In any case,
    1010transcoding imposes at least a cost of buffer copying.
    1111
     
    1313Given a specified input encoding, a CSA is responsible for checking that
    1414input code units represent valid characters, mapping the characters of the encoding into
    15 the appropriate bit streams for XML parsing actions (i.e., producing the lexical item
     15the appropriate \bitstream{}s for XML parsing actions (i.e., producing the lexical item
    1616streams), as well as supporting ultimate transcoding requirements.   All of this work
    17 is performed using the parallel bit stream representation of the source input.
     17is performed using the parallel \bitstream{} representation of the source input.
    1818
    1919An important observation is that many character sets are an
     
    2626A second observation is that---regardless of which character set is used---quite
    2727often all of the characters in a particular block of input will be within the ASCII range.
    28 This is a very simple test to perform using the bit stream representation, simply confirming that the
     28This is a very simple test to perform using the \bitstream{} representation, simply confirming that the
    2929bit 0 stream is zero for the entire block.   For blocks satisfying this test,
    3030all logic dealing with non-ASCII characters can simply be skipped.
    31 Transcoding to UTF-16 becomes trivial as the high eight bit streams of the
     31Transcoding to UTF-16 becomes trivial as the high eight \bitstream{}s of the
    3232UTF-16 form are each set to zero in this case.
    3333
    3434A third observation is that repeated transcoding of the names of XML
    35 elements, attributes and so on can be avoided by using a lookup mechanism.
    36 That is, the first occurrence of each symbol is stored in a lookup
     35elements, attributes and so on can be avoided by using a look-up mechanism.
     36That is, the first occurrence of each symbol is stored in a look-up
    3737table mapping the input encoding to a numeric symbol ID.   Transcoding
    38 of the symbol is applied at this time.  Subsequent lookup operations
     38of the symbol is applied at this time.  Subsequent look-up operations
    3939can avoid transcoding by simply retrieving the stored representation.
    40 As symbol lookup is required to apply various XML validation rules,
     40As symbol look up is required to apply various XML validation rules,
    4141there is achieves the effect of transcoding each occurrence without
    4242additional cost.
     
    4444The cost of individual character transcoding is avoided whenever a block of input is
    4545confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name.
    46 Furthermore, when transcoding is required, the parallel bit stream representation
    47 supports efficient transcoding operations.   In the important
    48 case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 bit streams
     46Furthermore, when transcoding is required, the parallel \bitstream{} representation
     47supports efficient transcoding operations.   
     48In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 \bitstream{}s
    4949can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
    50 and all but the final bytes of multibyte sequences can be marked for deletion as
     50and all but the final bytes of multi-byte sequences can be marked for deletion as
    5151discussed in the following subsection.
    5252In other cases, transcoding within a block only need be applied for non-ASCII
  • docs/Working/icXML/arch-contentstream.tex

    r2531 r2872  
    2222accounts for $6.83\%$ of Xerces's execution time.
    2323Additionally, it is cheap to locate the terminal character of each string:
    24 using the String End bit stream, the \PS{} can effectively calculate the offset of each
     24using the String End \bitstream{}, the \PS{} can effectively calculate the offset of each
    2525null character in the content stream in parallel, which in turn means the parser can
    2626directly jump to the end of every string without scanning for it.
  • docs/Working/icXML/arch-errorhandling.tex

    r2866 r2872  
    1111
    1212Within the \PS{}, all computations are performed in parallel, a block at a time.
    13 Errors are derived as artifacts of bit stream calculations, with a 1-bit marking the byte-position of an error within a block,
     13Errors are derived as artifacts of \bitstream{} calculations, with a 1-bit marking the byte-position of an error within a block,
    1414and the type of error is determined by the equation that discovered it.
    1515The difficulty of error processing in this section is that in Xerces the line and column number must be given
     
    2828During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
    2929and validated.
    30 A {\it line-feed bit stream}, which marks the positions of the normalized new lines characters, is a natural derivative of
     30A {\it line-feed \bitstream{}}, which marks the positions of the normalized new lines characters, is a natural derivative of
    3131this process.
    3232Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
    3333% The optimization delays the counting process ....
    3434Column position is more difficult to calculate.
    35 It is possible to scan backwards through the bit stream of new line characters to determine the distance (in code-units)
     35It is possible to scan backwards through the \bitstream{} of new line characters to determine the distance (in code-units)
    3636between the position between which an error was detected and the last line feed. However, this distance may exceed
    37 than the acutal character position for the reasons discussed in (2).
    38 To handle this, the CSA generates a {\it skip mask} bit stream by ORing together many relevant bit streams,
     37than the actual character position for the reasons discussed in (2).
     38To handle this, the CSA generates a {\it skip mask} \bitstream{} by ORing together many relevant \bitstream{}s,
    3939such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the
    4040normalization process.
     
    4242column number.
    4343
    44 % \begin{figure}[ht]
    45 % {\bf TODO: An example of a skip mask, error mask, and the raw data and transcoded data for it.
    46 % Should a multi-byte character be used and/or some CRLFs to show the difficulties?}
    47 % \label{fig:error_mask}
    48 % \caption{}
    49 % \end{figure}
    5044
    5145The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces.
     
    5650thus its impossible to derive the current location using only the content stream.
    5751To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}:
    58 the line-feed, skip mask, and a {\it deletion mask stream}, which is a bit stream denoting the (code-unit) position of every
    59 datum that was surpressed from the source during the production of the content stream.
     52the line-feed, skip mask, and a {\it deletion mask stream}, which is a \bitstream{} denoting the (code-unit) position of every
     53datum that was suppressed from the source during the production of the content stream.
    6054Armed with these, it is possible to calculate the actual line/column using
    6155the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position.
  • docs/Working/icXML/arch-namespace.tex

    r2522 r2872  
    5252To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is
    5353maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope
    54 NSID can be found using a bit scan instruction.
     54NSID can be found using a bit-scan intrinsic.
    5555A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID.
    5656
  • docs/Working/icXML/arch-overview.tex

    r2871 r2872  
    3737In \icXML{} functions are grouped into logical components.
    3838As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}.
    39 All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bit streams.
     39All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s.
    4040The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter},
    4141mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a
    42 set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}.
    43 These lexical bit streams are later transformed into UTF-16 in the \CSG{},
     42set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}.
     43These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{},
    4444after additional processing is performed.
    4545The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase.
    46 It takes the lexical streams and produces a set of marker bit streams in which a 1-bit identifies
    47 significant positions within the input data. One bit stream for each of the critical piece of information is created, such as
     46It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies
     47significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as
    4848the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content.
    4949Intra-element well-formedness validation is performed as an artifact of this process.
     
    5151The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an
    5252optimized population count algorithm, described in Section \ref{section:arch:errorhandling}.
    53 From here, two data-independent branches exist: the Symbol Pesolver and Content Preperation Unit.
     53From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit.
    5454
    5555A typical XML file contains few unique element and attribute names---but each of them will occur frequently.
     
    5858the raw data to produce a sequence of GIDs, called the {\it symbol stream}.
    5959
    60 The final components of the \PS{} are the {\it Content Preperation Unit} and {\it \CSG{}}.
    61 The former takes the (transposed) basis bit streams and selectively filters them, according to the
     60The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}.
     61The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the
    6262information provided by the Parallel Markup Parser, and the latter transforms the
    6363filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}.
     
    6565Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document.
    6666The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user.
    67 The {\it Final WF checker} performs inter-element wellformedness validation that would be too costly
    68 to perform in bitspace, such as ensuring every start tag has a matching end tag.
     67The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly
     68to perform in bit space, such as ensuring every start tag has a matching end tag.
    6969Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces,
    7070it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are
     
    8181\label{fig:icxml-arch}
    8282\end{figure}
    83 
    84 
    85 % Probably not the right area but should we discuss issues with Xerces design that we tried to correct?
    86 % - over-reliance on hash tables when domain knowledge dictated none would be needed
    87 % - constant buffering of text to ensure that every QName/NCName and content was contained within a single string
    88 % - abundant use of heap allocated memory
    89 % - text conversions done in multiple areas
    90 % - poor cache utilization; attempted to improve by using smaller layers of tasks in bulk
    91 
    92 % As the previous section aluded, the greatest difference between sequential parsing methods
    93 % and the Parabix parsing model is how data is processed.
    94 % Consider Figure \ref{fig:parabix1} again. In it, the start tags are located independent of the end
    95 % tags. In order to produce Xerces-equivalent output, icXML must emit the start and end tag
    96 % events in sequential order, with all attribute data associated with the correct tag.
    97 %
    98 %
    99 
    100 % The Parabix framework, however, does not allow for this (and would be hindered performance wise if
    101 % forced to.)
    102 % Thus our first question was, ``How can we how can we take full advantage
    103 % of Parabix whilst producing Xerces-equivalent output?'' Our answer came by analyzing what Xerces produced
    104 % when given an input text.
    105 %
    106 % By analyzing Xerces internal data structures and its produced output, two major observations were obvious:
    107 % (1) input data is transcoded into UTF-16 to ensure that there is a single standard character type, both
    108 % internally (within the grammar structures and hash tables) and externally (for the end user).
    109 % (2) all elements and attributes (both qualified and unqualified) are associated with a unique element
    110 % declaration or attribute definition within a specific grammar structure. Xerces emits the appropriate
    111 % grammar reference in place of the element or attribute string.
    112 
    113 
    114 
    115 
    116 
    117 %   From Xerces to icXML
    118 %
    119 %   - Philosophy:  Maximizing Bit Stream Processing
    120 %
    121 %   - Character Set Adapters vs. Transcoding
    122 %   - Bitstreams 1: Charset Validation and Transcoding equations
    123 %   - Bitstreams 2: Parabix style parsing and validation
    124 %
    125 %   - Bitstreams 3: Parallel filtering and normalization
    126 %           - LB normalization
    127 %           - reference compression -> single code unit speculation
    128 %           - parallel string termination
    129 %
    130 %   - Bitstreams 4: Symbol processing
    131 %
    132 %   - From bit streams to doublebyte streams: the content buffer
    133 %     
    134 %   - Namespace Processing: A Bitset approach.
  • docs/Working/icXML/background-fundemental-differences.tex

    r2866 r2872  
    11\subsection {Sequential vs. Parallel Paradigm}
    22
    3 % Sequential: bytes through layers
    43Xerces---like all traditional XML parsers---processes XML documents sequentially.
    54Each character is examined to distinguish between the
     
    98validation and content processing modes.
    109
    11 
    1210In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{
    13   Herein FSM applications are software systems whose behavior is defined by the inputs,
     11  Herein FSM applications are considered software systems whose behaviour is defined by the inputs,
    1412  current state and the events associated with transitions of states.}.
    1513Each state transition indicates the processing context of subsequent characters.
    1614Unfortunately, textual data tends to be unpredictable and any character could induce a state transition.
    1715
    18 % Unfortunately, textual data tends to consist of variable-length strings sequenced in
    19 % unpredictable patterns.
    20 % Each character must be examined in sequence because any character could be a state transition until deemed otherwise.
    21 
    22 
    23 
    24 
    25 % Parallel: blocks/segments/buffers through layers
    2616Parabix-style XML parsers utilize a concept of layered processing.
    27 A block of source text is transformed into a set of lexical bit streams,
     17A block of source text is transformed into a set of lexical \bitstream{}s,
    2818which undergo a series of operations that can be grouped into logical layers,
    2919e.g., transposition, character classification, and lexical analysis.
    3020Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}.
    31 % In adapting to the requirements of the Xerces sequential parsing API,
    32 % however, the resultant parallel bit streams may out-of-order \wrt{} the source document.
    33 % Hence they must be amalgamated and iterated through to produce sequential output.
    3421To meet the API requirements of the document-ordered Xerces output,
    35 the results of the Parabix processing layers must be interleaved to produce the equivalent behavior.
     22the results of the Parabix processing layers must be interleaved to produce the equivalent behaviour.
  • docs/Working/icXML/background-parabix.tex

    r2866 r2872  
    6363lines show streams that can be computed in subsequent
    6464parsing (using the technique
    65 of bitstream addition \cite{cameron-EuroPar2011}), namely streams marking the element names,
     65of \bitstream{} addition \cite{cameron-EuroPar2011}), namely streams marking the element names,
    6666attribute names and attribute values of tags. 
    6767
     
    8686sequential scanning loops for individual characters \cite{CameronHerdyLin2008}.
    8787Recent work has incorporated a method of parallel
    88 scanning using bitstream addition \cite{cameron-EuroPar2011}, as
     88scanning using \bitstream{} addition \cite{cameron-EuroPar2011}, as
    8989well as combining SIMD methods with 4-stage pipeline parallelism to further improve
    9090throughput \cite{HPCA2012}.
     
    9393
    9494Commercial XML processors support transcoding of multiple character sets and can parse and
    95 validate against multiple document vocabulaties.
     95validate against multiple document vocabularies.
    9696Additionally, they provide API facilities beyond those found in research prototypes,
    9797including the widely used SAX, SAX2 and DOM interfaces.
  • docs/Working/icXML/background-xerces.tex

    r2866 r2872  
    3131Figure \ref{fig:xerces-profile} shows the execution time profile of the top ten functions in a typical run.
    3232Even if it were possible, Amdahl's Law dictates that tackling any one of these functions for
    33 parallelization in isolation would only produce a minute improvement in perfomance.
    34 Unfortunetly, early investigation into these functions found
     33parallelization in isolation would only produce a minute improvement in performance.
     34Unfortunately, early investigation into these functions found
    3535that incorporating speculation-free thread-level parallelization was impossible
    3636and they were already performing well in their given tasks;
     
    6262\label {fig:xerces-profile}
    6363\end{figure}
    64 
    65 
    66 
    67 % Figure \ref{fig:xerces-arch} shows the
    68 % overall architecture of the Xerces C++ parser.
    69 % In analyzing the structure of Xerces, it was found that
    70 % there were a number of individual byte-at-a-time
    71 % processing tasks.
    72 %
    73 % \begin{enumerate}
    74 % \item Transcoding of source data to UTF-16
    75 % \item Character validation.
    76 % \item Line break normalization.
    77 % \item Character classification.
    78 % \item Line-column calculation.
    79 % \item Escape insertion and replacement.
    80 % \item Surrogate handling.
    81 % \item Name processing.
    82 % \item Markup parsing.
    83 % \item Attribute validation.
    84 % %\item Attribute checking.
    85 % %\item xmlns attribute processing.
    86 % \item Namespace processing.
    87 % \item Grammars, content model and data type validation.
    88 % \end{enumerate}
    89 
    90 
  • docs/Working/icXML/conclusion.tex

    r2869 r2872  
    11This paper is the first case study documenting the significant
    22performance benefits that may be realized through the integration
    3 of parallel bit stream technology into existing widely-used software libraries.
     3of parallel \bitstream{} technology into existing widely-used software libraries.
    44In the case of the Xerces-C++ XML parser, the
    55combined integration of SIMD and multicore parallelism was
     
    99to provide the full functionality of the original Xerces library
    1010with complete compatibility of APIs.  Although substantial
    11 reengineering was required to realize the
     11re-engineering was required to realize the
    1212performance potential of parallel technologies, this
    1313is an important case study demonstrating the general
     
    3030
    3131To overcome the software engineering challenges in applying
    32 parallel bit stream technology to existing software systems,
     32parallel \bitstream{} technology to existing software systems,
    3333it is clear that better library and tool support is needed.
    3434The techniques used in the implementation of \icXML{} and
     
    3636applications in other contexts and automated through
    3737the creation of compiler technology specifically supporting
    38 parallel bit stream programming.
     38parallel \bitstream{} programming.
    3939
  • docs/Working/icXML/icxml-main.tex

    r2871 r2872  
    2121\def \MP {Markup Processor}
    2222\def \wrt {with respect to}
     23\def \bitstream{bitstream}
    2324
    2425\title{\icXML{}:  Accelerating a Commercial XML Parser Using SIMD and Multicore Technologies}
  • docs/Working/icXML/multithread.tex

    r2871 r2872  
    1 %\section{Leveraging SIMD Parallelism for Multicore: Pipeline Parallelism}
    2 
    3 % As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine
    4 % Finite-state machine belongs to the hardest application class to parallelize and process efficiently
    5 % among all presented in Berkeley study reports \cite{Asanovic:EECS-2006-183}.
    6 % However, \icXML{} reconstructs Xerces and provides logical layers between modules,
    7 % which naturally enables pipeline parallel processing.
    8 
    91As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application.
    10 These are ``embarassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize.
     2These are ``embarrassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize.
    113However, \icXML{} is designed to organize processing into logical layers.   
    124In particular, layers within the \PS{} are designed to operate
     
    179
    1810The most straightforward division of work in \icXML{} is to separate
    19 the \PS{} and the \MP{} into distinct logical layers into two seperate stages.
     11the \PS{} and the \MP{} into distinct logical layers into two separate stages.
    2012The resultant application, {\it\icXMLp{}}, is a course-grained software-pipeline application.
    2113In this case, the \PS{} thread $T_1$ reads 16k of XML input $I$ at a time and produces the
     
    4739\end{figure}
    4840
    49 % In our pipeline model, each thread is in charge of one module or one group of modules.
    50 % A straight forward division is to take advantage of the layer between \PS{} and \MP{}.
    51 % In this case, the first thread $T_1$ will read 16k of XML input $I$ at a time
    52 % and process all the modules in \PS{} to generates
    53 % content buffer, symbol array, URI array, and store them to a pre-allocated shared data structure $S$.
    54 % The second thread $T_2$ consumes the data provided by the first thread and
    55 % goes through all the modules in Markup Processor and writes output $O$.
    56 
    57 % The shared data structure is implemented using a ring buffer,
    58 % where each entry consists of all the arrays shared between the two threads with size of 160k.
    59 % In the example of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
    60 % A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
    61 % In Figure \ref{threads_timeline1}, the processing time of the first thread is longer,
    62 % thus the second thread always wait for the first thread to finish processing one chunk of input
    63 % and write to the shared memory.
    64 % Figure \ref{threads_timeline2} illustrates a different situation where the second thread is slower
    65 % and the first thread has to wait for the second thread finishing reading the shared data before it can reuse the memory space.
    66 
    6741Overall, our design is intended to benefit a range of applications.
    6842Conceptually, we consider two design points.
     
    8761the \PS{} over the corresponding Xerces core, single-threaded
    8862\icXML{} delivers a 25\% speedup.   However, the most significant
    89 aspect of our two-stage multithreaded design then becomes the
     63aspect of our two-stage multi-threaded design then becomes the
    9064ability to hide the entire latency of parsing within the serial time
    9165required by the application.   In this case, we achieve
     
    9872the \PS{} is not worthwhile if the cost of application logic is little as
    997333\% of the end-to-end cost using Xerces.  To achieve benefits of
    100 further parallelization with multicore technology, there would
     74further parallelization with multi-core technology, there would
    10175need to be reductions in the cost of application logic that
    10276could match reductions in core parsing cost.
    103 
    104 % \begin{figure}
    105 % \includegraphics[width=0.45\textwidth]{plots/threads_timeline1.pdf}
    106 % \caption{}
    107 % \label{threads_timeline1}
    108 % \end{figure}
    109 %
    110 % \begin{figure}
    111 % \includegraphics[width=0.45\textwidth]{plots/threads_timeline2.pdf}
    112 % \caption{}
    113 % \label{threads_timeline2}
    114 % \end{figure}
    115 
  • docs/Working/icXML/parfilter.tex

    r2866 r2872  
    33
    44As just mentioned, UTF-8 to UTF-16 transcoding involves marking
    5 all but the last bytes of multibyte UTF-8 sequences as
     5all but the last bytes of multi-byte UTF-8 sequences as
    66positions for deletion.   For example, the two
    77Chinese characters \begin{CJK*}{UTF8}{gbsn}䜠奜\end{CJK*}
     
    1212from six bit positions representing UTF-8 code units (bytes)
    1313down to just two bit positions representing UTF-16 code units
    14 (doublebytes).   This compression may be achieved by
     14(double bytes).   This compression may be achieved by
    1515arranging to calculate the correct UTF-16 bits at the
    1616final position of each sequence and creating a deletion
     
    2020\verb'110110'.  Using this approach, transcoding may then be
    2121completed by applying parallel deletion and inverse transposition of the
    22 UTF-16 bit streams\cite{Cameron2008}.
     22UTF-16 \bitstream{}s\cite{Cameron2008}.
    2323
    2424\begin{figure*}[tbh]
     
    7070  CRLF = pablo.Advance(lex.CR) & lex.LF
    7171  callouts.delmask |= CRLF
    72 # Adjust LF streams for newline/column tracker
     72# Adjust LF streams for line/column tracker
    7373  lex.LF |= lex.CR
    7474  lex.LF ^= CRLF
     
    9494combined into the overall deletion mask.   After the
    9595deletion and inverse transposition operations are finally
    96 applied, a postprocessing step inserts the proper character
     96applied, a post-processing step inserts the proper character
    9797at these positions.   One note about this process is
    9898that it is speculative; references are assumed to generally
    9999be replaced by a single UTF-16 code unit.   In the case,
    100 that this is not true, it is addressed in postprocessing.
     100that this is not true, it is addressed in post-processing.
    101101
    102102The final step of combined filtering occurs during
  • docs/Working/icXML/performance.tex

    r2871 r2872  
    6262CPU cycles per byte for the SAXCount application.
    6363The speedup for \icXML{} over Xerces is 1.3x to 1.8x.
    64 With two threads on the multicore machine, our pipelined version can achieve speedup up to 2.7x.
     64With two threads on the multicore machine, \icXMLp{} can achieve speedup up to 2.7x.
    6565Xerces is substantially slowed by dense markup
    6666but \icXML{} is less affected through a reduction in branches and the use of parallel-processing techniques.
     
    116116fewer branches.  Figure \ref{branchmiss_GML2SVG} shows the corresponding
    117117improvement in branching behaviour, with a dramatic reduction in branch misses per kB.
    118 It is also interesting to note that pipelined \icXML{} goes even
    119 further.   In essence, in using pipeline parallelism to split the instruction
     118It is also interesting to note that \icXMLp{} goes even further. 
     119In essence, in using pipeline parallelism to split the instruction
    120120stream onto separate cores, the branch target buffers on each core are
    121121less overloaded and able to increase the successful branch prediction rate.
     
    131131and data-cache performance with the improvements in instruction-cache
    132132behaviour the most dramatic.   Single-threaded \icXML{} shows substantially improved
    133 performance over Xerces on both measures.   The pipelined version shows a slight
    134 worsening in data-cache performance, well more than offset by a further dramatic
    135 reduction in instruction-cache miss rate.   Again partitioning the instruction
    136 stream through the pipeline parallelism model has significant benefit.
     133performance over Xerces on both measures.   
     134Although \icXMLp{} is slightly worse \wrt{} data-cache performance,
     135this is more than offset by a further dramatic reduction in instruction-cache miss rate.
     136Again partitioning the instruction stream through the pipeline parallelism model has
     137significant benefit.
    137138
    138139\begin{figure}
Note: See TracChangeset for help on using the changeset viewer.