Changes between Version 1 and Version 2 of PI_Comment_CDATA

Sep 23, 2008, 11:03:57 AM (10 years ago)



  • PI_Comment_CDATA

    v1 v2  
     1(Sept. 23, 2008 - RDC Design Notes)
    13== PI_Comment_CDATA Filter ==
    3 The PI_Comment_CDATA Filter is a fast linear pass through the document
    4 to identify the location and extent of all processing instruction (PI), comment,
    5 and CDATA markup items.   The following results are produced in accord with
    6 the ArraySet model.
     5The PI_Comment_CDATA Filter is a key component of the Parabix multicore architecture, designed to both simplify and parallelize subsequent XML processing.
     6In essence, this component identifies the location and extent of all processing instruction (PI), comment, and CDATA markup items and filters
     7them out from subsequent processing.   Once complete, it establishes the property that all remaining markup start characters ([<&]) must indeed represent markup
     8start positions in any well-formed XML document.   This property then enables the document to broken into arbitrary segments that each may be
     9independently processed for markup items.  Furthermore, processing of subsequent markup items is simplified by eliminating the need to test
     10for processing instructions, comments or CDATA sections at any remaining opening angle bracket ("<") locations.
     12In accord with the ArraySet model, the PI_Comment_CDATA Filter produces the following outputs.
    813 1. A bit stream comprising 1 bits for all and only positions within the document that are within a processing instruction, comment or CDATA section, including the opening and closing "<" delimiters (PI_Comment_CDATA_stream).
    914 2. One counter each for the total number of processing instructions, comments and CDATA sections found within the document (Total_PI_Count, Total_Comment_Count, Total_CDATA_Count).
    1116 4. One array each for the lengths of processing instructions, comments and CDATA sections found within the document.
    13 This pass has several expected benefits in simplifying subsequent processing and enabling further
    14 optimizations.
    15  1. Once complete, parsing of remaining markup (tags and references) can proceed in a data parallel fashion; all remaining occurrences of markup start characters ([<&]) must be the actual start of markup, not data characters.
    16  2. Subsequent processing of remaining markup is simplified.  No logic for any of these three types of markup are required in subsequent parsing routines.  This simplifies the subsequent programming tasks and may enable optimizations that would not otherwise be available.
    17  3. The cost of testing for alternative markup types at markup positions is reduced.  No tests for any of these markup types is required at opening angle bracket positions that are not immediately followed with "!" or "?".
     18The PI_Comment_CDATA Filter requires that LexicalItemStream formation be complete.
    1920=== Method ===