wiki:PI_Comment_CDATA

Version 3 (modified by cameron, 11 years ago) (diff)

--

(Sept. 23, 2008 - RDC Design Notes)

PI_Comment_CDATA Filter

The PI_Comment_CDATA Filter is a key component of the Parabix multicore architecture, designed to both simplify and parallelize subsequent XML processing. In essence, this component identifies the location and extent of all processing instruction (PI), comment, and CDATA markup items and filters them out from subsequent processing. Once complete, it establishes the property that all remaining markup start characters ([<&]) must indeed represent markup start positions in any well-formed XML document. This property then enables the document to broken into arbitrary segments that each may be independently processed for markup items. Furthermore, processing of subsequent markup items is simplified by eliminating the need to test for processing instructions, comments or CDATA sections at any remaining opening angle bracket ("<") locations.

In accord with the ArraySet model, the PI_Comment_CDATA Filter produces the following outputs.

  1. A bit stream comprising 1 bits for all and only positions within the document that are within a processing instruction, comment or CDATA section, including the opening and closing ("<" and ">") delimiters (PI_Comment_CDATA_stream).
  2. One counter each for the total number of processing instructions, comments and CDATA sections found within the document (Total_PI_Count, Total_Comment_Count, Total_CDATA_Count).
  3. An array of numeric processing instruction target IDs, one for each processing instruction found within the document.
  4. One array each for the content start positions of processing instructions, comments and CDATA sections found within the document, where the content start position of a processing instruction is the first position after the "<?", the target name and any following whitespace, the content start position of a comment is the first position after the "<!--" delimiter and the content start position of a CDATA section is the first position after the "<![CDATA[" delimiter.
  5. One array each for the content lengths of processing instructions, comments and CDATA sections, where the content length is the length of the text from the content start position up to, but not including, the closing delimiter.

The PI_Comment_CDATA Filter requires that LexicalItemStream formation be complete.

Method

A bit stream is formed consisting of the logical and of the [<] stream shifted forward one position and the ?! stream (PI_Comment_CDATA_Start). This stream identifies all potential opening delimiters of processing instructions, comments and CDATA sections. Bit streams for potential closing delimiters are also formed for each of the markup types. Im the case of processing instructions and comments, the simple single-character bit streams ? and [-] are used to minimize the cost of bit stream computation. The full 3-character bit stream for CDATA end delimiters ("]]>") is computed, as it is also needed to prove the absence of this sequence in document content.

A sequential scan is then made through the PI_Comment_CDATA_Start stream. Upon encountering a start position for one of the markup types, the markup classification is determined. This may be achieved either through a sequential test of the three types or through a numeric code ([0-2]) calculated using a suitable hash function. Dependent on the markup type, scanning for the appropriate closing delimiter proceeds. The appropriate delimiter candidate bit stream is scanned until a position is found at which the delimiter is then definitively proven. The appropriate output values and arrays are updated and the calculated bit stream is generated.