source: docs/Working/icXML/arch-contentstream.tex @ 2531

Last change on this file since 2531 was 2531, checked in by nmedfort, 7 years ago

Content stream subsection for real - remember svn add

File size: 2.7 KB
1\subsection{Content Stream}
4A relatively-unique concept for \icXML{} is the use of a filtered content stream.
5Rather that parsing an XML document in its original format, the input is transformed
6into one that is easier for the parser to iterate through and produce the sequential
8In Figure~\ref{fig:parabix2}, the source data
9% \verb|<root><t1>text</t1><t2 a1=’foo’ a2 = ’fie’>more</t2><tag3 att3=’b’/></root>|
10is transformed into
11``{\tt\it 0}\verb`>fee`{\tt\it 0}\verb`=fie`{\tt\it 0}\verb`=foe`{\tt\it 0}\verb`>`{\tt\it 0}\verb`/fum`{\tt\it 0}\verb`/`''
12through the parallel filtering algorithm, described in section \ref{sec:parfilter}.
14Combined with the symbol stream, the parser traverses the content stream to effectively
15reconstructs the input document in its output form.
16The initial {\tt\it 0} indicates an empty content string. The following \verb|>|
17indicates that a start tag without any attributes is the first element in this text and
18the first unused symbol, ``\verb|document|'', is the element name.
19Succeeding that is the content string ``\verb`fee`'', which is null-terminated in accordance
20with the Xerces API specification. Unlike Xerces, no memory-copy operations
21are required to produce these strings, which as Figure~\ref{fig:xerces-profile} shows
22accounts for $6.83\%$ of Xerces's execution time.
23Additionally, it is cheap to locate the terminal character of each string:
24using the String End bit stream, the \PS{} can effectively calculate the offset of each
25null character in the content stream in parallel, which in turn means the parser can
26directly jump to the end of every string without scanning for it.
28Following ``\verb`fee`'' is a \verb`=`, which marks the existence of an attribute.
29Because all of the intra-element was performed in the \PS{}, this must be a legal attribute.
30Since attributes can only occur within start tags and must be accompanied by a textual value,
31the next symbol in the symbol stream must be the element name of a start tag,
32and the following one must be the name of the attribute and the string that follows the \verb`=` must be its value.
33However, the subsequent \verb`=` is not treated as an independent attribute because the parser has yet to
34read a \verb`>`, which marks the end of a start tag. Thus only one symbol is taken from the symbol stream and
35it (along with the string value) is added to the element.
36Eventually the parser reaches a \verb`/`, which marks the existence of an end tag. Every end tag requires an
37element name, which means they require a symbol. Inter-element validation whenever an empty tag is detected to
38ensure that the appropriate scope-nesting rules have been applied.
Note: See TracBrowser for help on using the repository browser.