Ignore:
Timestamp:
Oct 19, 2012, 6:56:35 PM (7 years ago)
Author:
nmedfort
Message:

more edits

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/multithread.tex

    r2502 r2505  
    1 \section{Leveraging SIMD Parallelism for Multicore: Pipeline Parallelism}
     1%\section{Leveraging SIMD Parallelism for Multicore: Pipeline Parallelism}
     2\section{Leveraging Pipeline Parallelism}
    23% As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine
    34% Finite-state machine belongs to the hardest application class to parallelize and process efficiently
     
    89As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine,
    910the hardest type of application to parallelize and process efficiently \cite{Asanovic:EECS-2006-183}.
    10 However, \icXML{} provides logical layers between modules,
    11 which naturally enables pipeline parallel processing.
     11However \icXML{} provides logical layers between modules, which naturally enables pipeline parallel processing.
    1212
    13 In our pipeline model, each thread is in charge of one module or one group of modules.
    14 A straight forward division is to take advantage of the layer between Parabix Subsystem and Markup Processor.
    15 In this case, the first thread $T_1$ will read 16k of XML input $I$ at a time
    16 and process all the modules in Parabix Subsystem to generates
    17 % content buffer, symbol array, URI array, context ID array and store them to a pre-allocated shared data structure $S$.
    18 content buffer, symbol array, URI array, and store them to a pre-allocated shared data structure $S$.
    19 The second thread $T_2$ consumes the data provided by the first thread and
    20 goes through all the modules in Markup Processor and writes output $O$.
    21 
     13In the pipeline model, each thread is in charge of a group of modules.
     14A straight forward division is to take advantage of the layer between \PS{} and \MP{}.
     15In this case, the first thread $T_1$ reads 16k of XML input $I$ at a time and produces the
     16content, symbol and URI streams, then stores them in a pre-allocated shared data structure $S$.
     17The second thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation
     18then generates the output $O$.
    2219The shared data structure is implemented using a ring buffer,
    23 where each entry consists of all the arrays shared between the two threads with size of 160k.
     20where every entry contains an independent set of data streams.
    2421In the example of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
    2522A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
    26 In Figure \ref{threads_timeline1}, the processing time of the first thread is longer,
    27 thus the second thread always wait for the first thread to finish processing one chunk of input
    28 and write to the shared memory.
    29 Figure \ref{threads_timeline2} illustrates a different situation where the second thread is slower
    30 and the first thread has to wait for the second thread finishing reading the shared data before it can reuse the memory space.
     23In Figure \ref{threads_timeline1} the processing time of $T_1$ is longer than $T_2$;
     24thus $T_2$ always waits for $T_1$ to write to the shared memory.
     25Figure \ref{threads_timeline2} illustrates the scenario in which $T_1$ is faster
     26and must wait for $T_2$ to finish reading the shared data before it can reuse the memory space.
     27
     28
     29% In our pipeline model, each thread is in charge of one module or one group of modules.
     30% A straight forward division is to take advantage of the layer between \PS{} and \MP{}.
     31% In this case, the first thread $T_1$ will read 16k of XML input $I$ at a time
     32% and process all the modules in \PS{} to generates
     33% content buffer, symbol array, URI array, and store them to a pre-allocated shared data structure $S$.
     34% The second thread $T_2$ consumes the data provided by the first thread and
     35% goes through all the modules in Markup Processor and writes output $O$.
     36
     37% The shared data structure is implemented using a ring buffer,
     38% where each entry consists of all the arrays shared between the two threads with size of 160k.
     39% In the example of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries.
     40% A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time.
     41% In Figure \ref{threads_timeline1}, the processing time of the first thread is longer,
     42% thus the second thread always wait for the first thread to finish processing one chunk of input
     43% and write to the shared memory.
     44% Figure \ref{threads_timeline2} illustrates a different situation where the second thread is slower
     45% and the first thread has to wait for the second thread finishing reading the shared data before it can reuse the memory space.
    3146
    3247To understand the performance improvement that can be achieved by this pipeline model,
Note: See TracChangeset for help on using the changeset viewer.