XML String Value Extraction

XML String Value Extraction is the process of extracting text strings for the values of attributes and the textual content of elements. It involves applying all the rules of XML with respect to string values, including character entity and general entity expansion, line-break normalization, CDATA section parsing and attribute value normalization. It also involves the possibility of transcoding between the document character set (DCS) in which an XML document is encoded and the working character set (WCS) that the application program uses for string processing.

Efficient string value extraction is a performance-critical process in XML parsing. The various XML rules for entity expansion and string normalization, as well as transcoding issues can potentially require expensive byte-by-byte processing. However, Parabix is designed to eliminate the cost of this processing in the vast majority of cases.

The Parabix Design for String Value Extraction

Plaintext Prefixes

The Parabix design for high-performance string value extraction is based on the concept of maximal plaintext prefixes. Here, the term plaintext is borrowed from cryptography to refer to text strings that require no decoding. For example, given the XML element "<p>ordinary&lt;</p>", the maximal plaintext prefix of the content of the "p" element is "ordinary". The string "&lt;" requires decoding as an entity reference and is hence not plaintext.

The exact definition of plaintext depends on the XML and character set context. There are four different XML contexts for string value extraction: extraction of an attribute value in "CDATA" mode (applying the CDATA rules for white space normalization), extraction of an attribute value in "NMTOKENS" mode (applying additional normalizations), extracting text in element content, and extracting text in a CDATA section. For example, note that "&lt;" occurring within a CDATA section is not an entity reference, so that the maximal plaintext prefix of the contents of the CDATA section "<![CDATA[ordinary&lt;]]>" is "ordinary&lt;", the full contents.

Character set contexts refer to various combinations of document character set families (ASCII/UTF-8 family, UTF-16 family, UTF-32 family, extended-ASCII family, EBCDIC family) and the supported working character sets (UTF-8, UTF-16, UTF-32). For example, for the ASCII/UTF-8 family as document character set and the UTF-16 working character set, simple sequences of ASCII values can be easily transcoded to UTF-16 by inserting null bytes, while multibyte UTF-8 sequences require decoding to produce the corresponding UTF-16 values. Thus, in this context, multibyte UTF-8 sequences are excluded from the definition of plaintext. On the other hand, if the document and working character sets are both UTF-8, then no transcoding is required. In this context, multibyte sequences may then be included in the definition of plaintext.

Nonplain Bitstreams

Based on this concept of plaintext prefixes, Parabix achieves high-performance string value extraction through the construction and use of nonplain bitstreams. A nonplain bitstream marks with 0 bits each byte position that may be considered plaintext, while each position not classified as plaintext is marked with a 1 bit. Given an XML document fragment for string value extraction and the corresponding plaintext bitstream for this fragment, maximal plaintext prefix is then easily found by scanning to the position of the first 1 bit in the plaintext bitstream. This gives immediately gives the length of the maximal plaintext prefix, allowing it to be quickly extracted without byte-at-a-time processing.

Whenever the full text of a string to be extracted is plaintext, the extraction process will be very fast, completely bypassing any rules for normalization, expansion or complex transcoding. In many XML documents, perhaps even most, 100% of the required extractions will consist entirely of plaintext. For these documents, extraction will be reasonably fast in every instance with a single bitscan and substring selection required. In other documents, there will be occurrences of nonplain text data, but these occurrences will often be quite infrequent.

Structure of String Value Extraction Routines

Parabix defines separate string value extraction routines for each combination of extraction context and character set context. However, each of these routines has a similar structure as reflected by the following pseudocode.

plaintext_length = ScanTo(nonplain); 
emit plaintext.
processed_length = plaintext_length;
while (processed_length < full text length) {
  examine one or more bytes at first unprocessed position and
  process according to special case rules for normalization, expansion,
  or transcoding;
  extend processed_length to include the examined bytes.
  plaintext_length = ScanTo(nonplain);  // find more plaintext
  emit plaintext.
  processed_length += plaintext_length;

The combinations for the different extraction routines may be modeled as (TextContext, DCS, WCS) triples. Here, TextContext may take on one of the four values AttCDATA for attribute extraction with CDATA whitespace normalization, AttToken for attribute extraction with token-based whitespace normalization, ElemText for text content in elements and CDATAtext for text content within CDATA sections. The DCS itself may be modeled as a pair (!CharSet_Family, !CharSet_Name) identifying the document character set within one of severaly families (one of UTF8_Family, U16_Family, U32_Family, Extended_ASCII_Family, EBCDIC_Family). The !CharSet_Name is an IANA standard name for a specific character set within the family. In the Parabix design, the WCS parameter is simpler, taking on one of three possible values (UTF_8, UTF_16, UTF_32), for the three working character sets supported.

Attribute and String Value Buffers

Parabix defines two linear buffers for receiving the results of the text extraction process. The AttTextBuffer consists of a concatenated sequence of extracted attribute values, while the StringValueBuffer consists of a similar concatenated sequence of text node values from the content of elements. The StringValueBuffer should consist of all text items in document order, so that the XPath/XSLT/XQuery StringValue of any node is a substring of this buffer.

Last modified 9 years ago Last modified on Jan 1, 2010, 12:48:44 PM