Version 1 (modified by cameron, 10 years ago) (diff)


Parabix ArraySet Model

Introduction and Rationale

The Parabix ArraySet Model is an array-oriented model for representing information extracted from XML documents, including information satisfying the full InfoSet requirements. It may be contrasted with a more traditional object-oriented model in which an XML document is directly represented as a tree of nodes.

The ArraySet Model represents information using a number of arrays, each of which holds information of a particular kind extracted from the document. The array elements generally consist of simple numeric or character values, extracted from the document in document order. For example, the CT_pos array holds an array of document positions at which XML comments enclosed in <!-- and --> occur.

The primary purpose of the ArraySet model is to support high-performance XML processing in consideration of the software and hardware resources typically available in commodity processing environments.

  1. Prefetching. Commodity processors commonly support hardware and/or software prefetching to ensure that data is available in a processor cache when it is needed. In general, prefetching is most effective in conjunction with the continuous sequential memory access patterns associated with array processing.
  1. DMA. Some processing environments provide Direct Memory Access (DMA) controllers for block data movement in parallel with computation. For example, the Cell Broadband Engine uses DMA controllers to move the data to and from the local stores of the synergistic processing units. Arrays of contiguous data elements are well suited to bulk data movement using DMA.
  1. SIMD. Single Instruction Multiple Data (SIMD) capabilities of modern processor instruction sets allow simultaneous application of particular instructions to sets of elements from parallel arrays. For effective use of SIMD capabilities, an SoA (Structure of Arrays) model is preferrable to an AoS (Array of Structures) model.
  1. Multicore processors. Array-oriented processing can enable the effective distribution of work to the individual cores of a multicore system in two distinct ways. First, provided that sequential dependencies can be minimized or eliminated, large arrays can be divided into separate segments to be processed in parallel on each core. Second, pipeline parallelism can be used to implement efficient multipass processing with each pass consisting of a processing kernel with array-based input and array-based output.
  1. Streaming Buffers for Large XML Documents. In the event that an XML document is larger than can be reasonably represented entirely within processor memory, a buffer-based streaming model can be applied to work through a document using sliding windows over arrays of elements stored in document order.
  1. JNI. The Java Native Interface (JNI) allows communication between a Java runtime environment and native processing resources on a host machine, but can impose substantial overhead with each call. In addition, data type conversion may be needed for all but the simplest data types. Bulk transfer of arrays of simple types (e.g., integers) can minimize both the the number of JNI invocations and the cost of data conversion.