Version 1 (modified by cameron, 11 years ago) (diff)


PI_Comment_CDATA Filter

The PI_Comment_CDATA Filter is a fast linear pass through the document to identify the location and extent of all processing instruction (PI), comment, and CDATA markup items. The following results are produced in accord with the ArraySet model.

  1. A bit stream comprising 1 bits for all and only positions within the document that are within a processing instruction, comment or CDATA section, including the opening and closing "<" delimiters (PI_Comment_CDATA_stream).
  2. One counter each for the total number of processing instructions, comments and CDATA sections found within the document (Total_PI_Count, Total_Comment_Count, Total_CDATA_Count).
  3. One array each for the start positions of processing instructions, comments and CDATA sections found within the document.
  4. One array each for the lengths of processing instructions, comments and CDATA sections found within the document.

This pass has several expected benefits in simplifying subsequent processing and enabling further optimizations.

  1. Once complete, parsing of remaining markup (tags and references) can proceed in a data parallel fashion; all remaining occurrences of markup start characters ([<&]) must be the actual start of markup, not data characters.
  2. Subsequent processing of remaining markup is simplified. No logic for any of these three types of markup are required in subsequent parsing routines. This simplifies the subsequent programming tasks and may enable optimizations that would not otherwise be available.
  3. The cost of testing for alternative markup types at markup positions is reduced. No tests for any of these markup types is required at opening angle bracket positions that are not immediately followed with "!" or "?".


A bit stream is formed consisting of the logical and of the [<] stream shifted forward one position and the ?! stream (PI_Comment_CDATA_Start). This stream identifies all potential opening delimiters of processing instructions, comments and CDATA sections. Bit streams for potential closing delimiters are also formed for each of the markup types. Im the case of processing instructions and comments, the simple single-character bit streams ? and [-] are used to minimize the cost of bit stream computation. The full 3-character bit stream for CDATA end delimiters ("]]>") is computed, as it is also needed to prove the absence of this sequence in document content.

A sequential scan is then made through the PI_Comment_CDATA_Start stream. Upon encountering a start position for one of the markup types, the markup classification is determined. This may be achieved either through a sequential test of the three types or through a numeric code ([0-2]) calculated using a suitable hash function. Dependent on the markup type, scanning for the appropriate closing delimiter proceeds. The appropriate delimiter candidate bit stream is scanned until a position is found at which the delimiter is then definitively proven. The appropriate output values and arrays are updated and the calculated bit stream is generated.