# Changeset 1774

Ignore:
Timestamp:
Dec 13, 2011, 4:50:42 PM (7 years ago)
Message:

minor changes

Location:
docs/HPCA2012/final_ieee
Files:
15 edited

Unmodified
Removed
• ## docs/HPCA2012/final_ieee/00-abstract.tex

 r1752 Modern applications employ text files widely for providing data storage in readable format for applications ranging from database storage in a readable format for applications ranging from database systems to mobile phones. Traditional text processing tools are built around a byte-at-a-time sequential processing model, and introduce around a byte-at-a-time sequential processing model that introduces significant branch and cache miss penalties.  Recent work has explored an alternative, transposed representation of text, Parabix (Parallel Bit Streams), to accelerate scanning and parsing using SIMD facilities. This paper further advocates and develops Parabix as a general framework This paper advocates and develops Parabix as a general framework and toolkit, describing the software toolchain and run-time support that allows applications to exploit modern SIMD instructions for high Parabix exploits intra-core SIMD hardware and demonstrates 2$\times$--7$\times$ speedup and 4$\times$ improvement in energy efficiency compared to two widely used conventional software parsers, efficiency when compared with two widely used conventional software parsers, Expat and Apache-Xerces. SIMD implementations across three generations of x86 processors are studied including the new \SB{}. The 256-bit AVX technology in Intel \SB{} is compared with the well-established 128-bit well established 128-bit SSE technology to analyze the benefits and challenges of 3-operand instruction formats and wider SIMD hardware.  Finally, that thread-level parallelism enables the application to exploit SIMD units scattered across the different cores, achieving improved performance (2$\times$ on 4 cores) at same energy levels as the single-thread version for the XML application. cores) while maintaining single-threaded energy levels.
• ## docs/HPCA2012/final_ieee/01-intro.tex

 r1768 \section{Introduction} Modern applications ranging from web search to analytics are mainly data centric operating over large swaths of data. Information expansion data centric operating over large swaths of information. Information expansion and diversification of data has resulted in multiple textual storage formats.  Of these, XML is one of the most widely used standards, providing In this paper, we generalize parallel bitstreams and develop the In this paper, we generalize parallel bit streams and develop the Parabix programming framework to help programmers build text processing appliations. The programmers specify the operations on unbounded character lists using bitstreams in a python environment, while our code generation and runtime translate them into low-level processing applications. Programmers specify operations on unbounded character lists using bit streams in a python environment. Our code generation and runtime system translates them into low-level C++ routines.  The Parabix routines exploit the SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM) to process hundreds of character positions in an input stream simultaneously dramatically of character positions in an input stream simultaneously, and dramatically improving the execution efficiency. We describe the overall Parabix tool chain, a novel execution framework and a software build environment tool chain, a novel execution framework and software build environment that enables text processing applications to effectively exploit commodity multicores. architectures. Figure~\ref{perf-energy} showcases the overall efficiency of our framework, dramatically improving both performance and framework and dramatic improvements in both performance and energy efficiency. The Parabix-XML parser exploits the bitstream technology to dramatically reduce branches in the parsing routines resulting in a more efficient pipeline. It also the bit stream technology to dramatically reduce branches in parsing routines and realize a more efficient pipeline execution. It also substantially improves register utilization which minimizes energy wasted on cache misses and data transfers.\footnote{The actual energy consumption of the XML 1) We outline the Parabix architecture, code-generation tool chain and runtime environment and describe how it may be used to produce runtime environment; and describe how it may be used to produce efficient XML parser implementations on a variety of commodity processors.  While studied in the context of XML parsing, the Parabix architecture, tool chain and runtime environment. Section~\ref{section:parser} describes the design of an XML parser based on the Parabix framework.  Section~\ref{section:baseline} based on the Parabix framework. Section \ref{section:methodology} details our evaluation framework. Section~\ref{section:baseline} presents a detailed performance analysis of Parabix on a \CITHREE\ system using hardware performance counters.
• ## docs/HPCA2012/final_ieee/02-background.tex

 r1737 parsing is that processing an XML document requires at least one conditional branch per byte of source text.  For example, Xerces-C, which forms the foundation for widely deployed the Apache XML project which forms the foundation for the widely deployed the Apache XML project \cite{xerces}, uses a series of nested switch statements and state-dependent flag tests to control the parsing logic of the program. Xerces's complex data dependent control flow requires between 6 -- 13 branches per byte of XML input, depending on the markup in 6--13 branches per byte of XML input, depending on the markup in the file (details in Section~\ref{section:XML-branches}).  Cache utilization is also significantly reduced due to the manner in which XML data. In general, while microarchitectural improvements may help the parser tide over some of these challenges (e.g., cache misses), the fundamental data and control flow in the parsers are ill suited for fundamental data and control flow in byte-at-a-time parsers are ill suited for commodity processors and experience significant overhead.
• ## docs/HPCA2012/final_ieee/03-research.tex

 r1751 traditional text processing models is in how Parabix represents the source data.  Given a traditional byte-oriented text stream, Parabix first transposes the text data to a transform domain consisting of 8 parallel bit streams, known as {\em basis bit streams}.  In essence, each basis bit stream $b_{k}$ represents the stream of $k$-th bit of first transposes the text data into a transform representation consisting of 8 parallel bit streams, known as {\em basis bit streams} wherein basis bit stream $b_{k}$ represents the stream of $k$-th bit of each byte in the source text.  That is, the $k$-th bit of $i$-th byte in the source text is in the $i$-th (bit) position of the $k$-th basis in the source text is in one-to-one correspondence with the $i$-th bit of the $k$-th basis bit stream, $b_{k}$.  For example, in Figure~\ref{fig:BitStreamsExample}, we show how the ASCII string {\ttfamily b7\verb<A}'' is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been \ldots 7}$. The bits used to construct$\tt b_7$are highlighted in this example. use the 128-bit SIMD registers commonly found on commodity processors (e.g. SSE on Intel) to process 128 byte positions at a time using bitwise logic, shifting and other operations. bitwise logical, shift and arithmetic operations. Just as forward and inverse Fourier transforms are used to transform between the time and frequency domains in signal processing, bit stream transposition and inverse transposition provides byte space'' and bit space'' views of text. The goal of the Parabix framework is and bit space'' domains for text. The goal of the Parabix framework is to support efficient text processing using these two equivalent representations in the same way that efficient signal processing representations in the analogous way that efficient signal processing benefits from the use of the frequency domain in some cases and the time domain in others. metadata parsing. % For example, in a CSV file, any ,' or \textbackslash n' can indicate the start of a new column or row respectively. For example, in XML, any opening angle bracket character, \verb<', may indicate that we are starting a new markup tag. For example, in XML, any opening angle bracket character, \verb<', may indicate the start of a markup tag. Traditional byte-at-a-time parsers find these characters by comparing the value of each byte with a set of known significant characters and branching appropriately when one is found, typically using an if or switch statement. % However, a <' is legal within an XML comment so not every <' necessarily means that we are opening a new tag. Character-class bit streams allow us to perform up to 128 comparisons'' in parallel with a single operation by using a series of boolean-logic operations \footnote{$\land$,$\lor$and$\lnot$denote the boolean AND, OR and NOT operations.} to merge multiple basis bit streams into a single character-class stream that marks the positions of key characters with a$1$. For example, a character is Character-class bit streams enable up to 128 comparisons'' in parallel through a series of boolean-logic operations \footnote{$\land$,$\lor$and$\lnot$denote the boolean AND, OR and NOT operations.} that merge multiple basis bit streams into a single character-class stream that marks the positions of key characters. For example, a character is an \verb<' if and only if$\lnot(b_ 0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$. Classes of characters can be found with similar formulas. For example, a b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.  Addition character classes can be determined with similar formulas.  For example, a character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using a mixture of both boolean logic and arithmetic operations. Lexical bit streams typically mark multiple current parsing positions. Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel. Unlike the single-cursor approach of traditional text parsers, the marking of multiple lexical items allows Parabix to process multiple items in parallel. Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness issues found during the parsing process. The presence of a $\tt 1$ in an error stream indicates that the lexical stream cannot be trusted to be completely accurate and it may be necessary to perform some sequential parsing on that section to determine the cause and severity of the error. %How errors are handled depends on the logical implications of the error and go beyond the scope of this paper. To form lexical bit streams, we have to introduce a few new operations: {\tt Advance} and {\tt ScanThru}. issues found during the parsing process. A $\tt 1$ bit in an error stream indicates the precense of a potential error that may require further processing to determine cause and severity. To form lexical bit streams we introduce two new operations: {\tt Advance} and {\tt ScanThru}. The {\tt Advance} operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits, and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right. {\tt ScanThru} accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent $\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$. {\tt ScanThru} accepts two input parameters, $c$ and $m$, where $c$ denotes an initial set of cursor positions, and $m$ denotes a set of marked'' lexical item positions. The ScanThru operation determines the cursor positions immediately following any run of marker positions by calculating $(c + m) \land \lnot m$. For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb<[a-zA-Z]+> and wish to find all instances of it in the source text. token. By computing $E_{0}$, the parser notes that \verb<>'' does not match the expected pattern. To find the end positions of each token, the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each token ends with a \verb>' and discovers that \verb' and discovers that \verb
• ## docs/HPCA2012/final_ieee/03b-research.tex

 r1733 \end{figure*} This section describes the implementation of the Parabix XML parser. Figure \ref{parabix_arch} shows its overall structure set up for well-formedness checking. This section describes the implementation of the Parabix XML parser for well-formedness checking. Figure \ref{parabix_arch} shows its overall structure. The input file is processed using 11 functions organized into 7 modules. In the first module, {\tt Read\_Data}, the input file is loaded into the
• ## docs/HPCA2012/final_ieee/04-methodology.tex

 r1743 against two widely available open-source parsers: Xerces-C \cite{xerces} and Expat \cite{expat}. Each of the parsers is evaluated on the task of implementing the parsing and well-formedness validation requirements of the full parsing and well-formedness checking requirements of the full XML 1.0 specification\cite{TR:XML}. Xerces-C version 3.1.1 (SAX) is a validating XML \paragraph{Platform Hardware:} SSE extensions have been available on commodity Intel processors for SSE SIMD extensions have been available on commodity Intel processors for over a decade since the Pentium III. They have steadily evolved with improvements in instruction latency, cache interface, register resources, and the addition of domain specific instructions. Here we investigate SIMD extensions across three different generations of intel processors (hardware details in Table \ref{hwinfo}). We compare the energy and performance profile of the Parabix under the platforms. intel processors (hardware details given in Table \ref{hwinfo}). We compare the energy and performance profile of the Parabix parser on each of the platforms. We also analyze the implementation specifics of SIMD extensions under various microarchitectures and the newer AVX extensions supported by Sandybridge. various microarchitectures as well as the newer AVX extensions supported by \SB{}. We investigated the execution profiles of each XML parser We investigate the execution profiles of each XML parser using the performance counters found in the processor. We chose several key hardware events that provide insight into the profile of each We choose several key hardware events that provide insight into the profile of each application and indicate if the processor is doing useful work ~\cite{bellosa2001, bertran2010}. The set of events included in our study are: Branch instructions, Branch mispredictions, Integer instructions, SIMD instructions, and Cache misses. In The set of events included in our study are: branch instructions, branch mispredictions, integer instructions, SIMD instructions, and cache misses. In addition, we characterize the SIMD operations and study the type and class of SIMD operations using the Intel Pin binary instrumentation monitored by an Agilent 34410a digital multimeter at the granularity of 100 samples per second. This measurement captures the instantaneous power to the processor package, including cores, caches, northbridge power to the processor package, including the cores, caches, northbridge memory controller, and the quick-path interconnects. We obtain samples throughout the entire execution of the program and then calculate overall
• ## docs/HPCA2012/final_ieee/05-corei3.tex

 r1768 \label{section:baseline} In this section we analyze the energy and performance characteristics of the Parabix-based XML parser against the software XML parsers, of the Parabix XML parser against the software XML parsers, Xerces and Expat. For our baseline evaluation, we compare all the XML parsers on the \CITHREE{}. Table \ref{cache_misses} shows the cache misses per kilobyte of input Table \ref{cache_misses} shows cache misses per kilobyte of input data. Analytically, the cache misses for the Expat and Xerces parsers represent a 0.5 cycle per XML byte cost.\footnote{The approximate miss penalty on the \CITHREE\ for L1, L2 and L3 caches is This overhead does not necessarily impact the overall performance of these parsers as they experience additional overheads related to branch mispredictions. Compared to Xerces and Expat, the data organization of Parabix-XML This overhead has little impact on the overall performance of these parsers as they experience additional overheads related to branch mispredictions. Wne compared with Xerces and Expat, the data organization of Parabix-XML significantly reduces the overall cache miss rate; specifically, there were $7\times$ and $15\times$ fewer L1 and L2 cache misses compared to The performance of traditional parsers is limited by their branch behavior.  Xerces experiences up to 13 branches per input XML character on the high markup files; Expat experiences up to 8 branches per XML character.  In Parabix-XML, the use of SIMD operations eliminates many branches.  Most conditional branches can be replaced character on the high markup files whereas Expat experiences up to 8. In Parabix-XML, the use of SIMD operations eliminates a significant proportion of the overall branches.  Most conditional branches can be replaced with bitwise operations, which can process up to 128 characters worth of branches with one operation or with a series of logical predicate The high miss prediction rate in conventional parsers is a significant overhead. The cost of a single branch misprediction is on the order of 10s of CPU cycles spent to restart the processor pipeline on a misprediction. Parabix-XML is nearly branch free and exhibits minimal dependence on the source markup density. Specifically, it experiences between 19.5 and 30.7 branch mispredictions per kB of XML data. Conversely, the cost of branch mispredictions for the Expat parser can be over 7 cycles per XML byte (see Figure~\ref{corei3_BM}) The high branch misprediction rate of conventional parsers is a significant overhead, with the cost of a single branch mispredic- tion on the order of 10s of CPU cycles spent to restart the processor pipeline. Parabix-XML is nearly branch free and exhibits minimal dependence on the source markup density. Specifically, our study demonstrates that Parabix experiences between 19.5 and 30.7 branch mispredictions per kB of XML data. Conversely, the cost of branch mispredictions for the Expat parser can be over 7 cycles per XML byte (see Figure~\ref{corei3_BM}) --- which exceeds the average latency of a byte processed by Parabix-XML. Unfortunately, it is difficult to reduce the branch misprediction rate of traditional XML parsers due to: (1) the variable length nature of The branch misprediction rate of traditional XML parsers is difficult to reduce due to a number of factors: (1) the variable length nature of the syntactic elements contained within XML documents; (2) input data dependent characteristic, and (3) the extensive set of syntax } \end{center} \caption{Branch Mispredictions on the \CITHREE{}. (/ 1kB input)} \caption{Branch Mispredictions on the \CITHREE{} per kB input} \label{corei3_BM} on data-oriented input.  Traditional parsers can be dramatically slowed by dense markup but Parabix-XML is relatively unaffected. Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 before processing it; this requires several cycles per byte. However, Unlike Parabix-XML and Expat, Xerces transcodes input to UTF-16 prior to processing; this requires several cycles per byte. However, transcoding using parallel bit streams is significantly faster and requires less than a single cycle per byte. The energy trends shown in Figure \ref{corei3_energy} reveal an interesting trend. Parabix consumes substantially less energy than interesting result. Parabix consumes substantially less energy than the other parsers. Parabix consumes 50 to 75 nJ per byte while Expat and Xerces consume 80nJ to 320nJ and 140nJ to 370nJ per byte respectively. Parabix-XML experiences minimal increase in power ($\sim5\%$) compared to the conventional parsers. While the SIMD respectively. Parabix-XML experiences minimal increase in power consumption ($\sim5\%$) as compared to the conventional parsers. While the SIMD functional units are significantly wider than the scalar counterparts, register width and functional unit power account only for a small fraction of the overall power consumption in a processor pipeline. Parabix amortizes the fetch and data access overheads over for a small fraction of the overall power consumption in a pipeline processor. Parabix amortizes the fetch and data access overheads over multiple data parallel operations. Although Parabix requires slightly more power (per instruction), the processing time of Parabix
• ## docs/HPCA2012/final_ieee/06-scalability.tex

 r1743 In this section, we study the performance of the XML parsers across three generations of Intel architectures.  Figure \ref{Parabix_all_platform} shows the average execution time of Parabix-XML (over all workloads).  We analyze the shows the average execution time of Parabix-XML over all workloads.  We analyze the execution time in terms of SIMD operations that operate on bit streams'' (\textit{bit-space}) and scalar operations that perform post processing'' on the original source bytes.  In Parabix-XML, a significant fraction of the overall execution time is spent on SIMD operations. in \textit{bit-space} and scalar operations used to perform post processing'' operations on the source input. Our results demonstrate that Parabix-XML's optimizations complement \CITHREE{} has a 40\% performance increase over \CO{}; similarly, \SB{} has a 20\% improvement compared to \CITHREE{}. These gains appear to be independent of the markup density of the input file. \CITHREE{}. These gains appear independent of the markup. Postprocessing operations demonstrate data dependent variance. Performance on the \CITHREE{} increases by \CITHREE\ improves performance only by 29\% over \CO\ while \SB\ improves performance by less than 6\% over \CITHREE{}. Note that the gains of \CITHREE\ over \CO\ includes an improvement both in the clock frequency and microarchitecture improvements while \SB{}'s gains can be mainly attributed to the architecture. gains of \CITHREE\ over \CO\ includes an improvement both in clock frequency and microarchitecture while \SB{}'s gains are mainly attributed to the architecture. Figure \ref{Parabix_all_platform} also shows the average power consumption of Parabix-XML over each workload and as executed on each of the processor cores: \CO{}, \CITHREE\ and \SB{}.  Each generation of processor seem to bring with them 25--30\% improvement Parabix-XML over each workload and as executed on each of the processors: \CO{}, \CITHREE\ and \SB{}.  Each generation of processor appears to bring a 25--30\% improvement in power consumption over the previous generation. Parabix-XML on \SB\ consumes 72\%--75\% less energy than it did on \CO{}. \def\CORTEXA8{Cortex-A8} \subsection{Parabix on Mobile processors} \subsection{Parabix on Mobile Processors} \label{section:scalability:\NEON{}} Our experience with Intel processors led us to question whether mobile processors with SIMD support, such as the ARM \CORTEXA8{}, could benefit from Parabix technology. ARM \NEON{} provides a 128-bit SIMD instruction set similar in functionality to Intel SSE3 instruction instruction set similar in functionality to the Intel SSE3 instruction set. In this section, we present our performance comparison of a \NEON{}-based port of Parabix versus the Expat parser. Xerces is excluded The platform we use is the Samsung Galaxy Android Tablet that houses a Samsung S5PC110 ARM \CORTEXA8{} 1Ghz single-core, dual-issue, superscalar microprocessor. It includes a 32kB L1 data cache and a superscalar microprocessor. This device includes a 32kB L1 data cache and a 512kB L2 shared cache.  Migration of Parabix-XML to the Android platform only required developing a Parabix runtime library for ARM \NEON{}. directly. However, a small subset of key SIMD instructions (e.g., bit packing) did not exist on \NEON{}. In such cases, the logical equivalent of those instructions was emulated using the available logical equivalents of those instructions were emulated using the available ISA. The resulting application was cross-compiled for Android using the Android NDK. Expat and Parabix for the various input workloads on the \CORTEXA8{}; Figure~\ref{relative_performance_intel} plots the performance for \CITHREE{}. The results demonstrate that that the execution time of \CITHREE{}. The results demonstrate that the execution time of each parser varies in a linear fashion with respect to the markup density of the file. On the both \CORTEXA8{} and \CITHREE{} both implemented as a coprocessor on the \CORTEXA8{}, which imposes a higher overhead for applications that frequently inter-operate between scalar and SIMD registers. Future performance enhancement to ARM \NEON{} that implement the \NEON{} within the core microarchitecture could substantially improve the efficiency of Parabix-XML. and SIMD registers. Future performance enhancements to the \NEON{} ISA on ARM could substantially improve the efficiency of Parabix.
• ## docs/HPCA2012/final_ieee/07-avx.tex

 r1751 \subsection{3-Operand Form} In addition to widening the 128-bit operations to 256-bit, In addition to widening the 128-bit operations to 256-bit operations, AVX technology uses a nondestructive 3-operand instruction format. Previous SSE implementations used a destructive 2-operand as both a source and destination register. As such, 2-operand instructions that require the value of both $a$ and $b$, must either copy an additional register value beforehand, or reconstitute or reload a register value value beforehand, or reconstitute a register value afterwards to recover the value.  With the 3-operand format, output may now be directed to the third register independently of the source copy or reconstitute operand values, a considerable reduction in instructions required for unloading from and loading into registers.  AVX technology makes available the 3-operand form for both the new 256-bit AVX and as the 128-bit SSE operations. registers is achieved.  AVX technology makes available the 3-operand form for both the new 256-bit AVX as well as the 128-bit SSE operations. \subsection{256-bit Operations} leverage the 256-bit AVX instructions wherever possible and to simulate the remaining operations using pairs of 128-bit operations. Figure \ref{insmix} shows the reduction in instruction counts achieved in these two versions. For each workload, the base instruction count of the Parabix binary compiled in 2-operand SSE-only mode is indicated by sse;'' \ref{insmix} shows the reduction in instruction count achieved in each version. For each workload, the base instruction count of the Parabix binary compiled in 2-operand SSE-only mode is indicated by sse''; the version that only takes advantage of the AVX 3-operand mode is labeled 128-bit avx,'' and the version uses the 256-bit operations wherever possible is labeled 256-bit avx.''  The labeled 128-bit avx'', and the version uses the 256-bit operations wherever possible is labeled 256-bit avx''.  The instruction counts are divided into three classes: non-SIMD'' operations are the general purpose instructions.  The bitwise SIMD'' remains relatively constant with each workload.  As expected, the number of bitwise SIMD operations remains the same for both SSE and 128-bit while dropping dramatically when operating for both SSE and 128-bit AVX while dropping dramatically when operating 256-bits at a time. The reduction was measured at 32\%--39\% depending on markup density of the workload. The other SIMD'' class benefits of 3-operand form seem to fully translate to performance benefits.  Based on the reduction of overall Bitwise-SIMD instructions we expected a 11\% improvement in performance.  Instead, perhaps bizarrely, the performance of Parabix in the 256-bit AVX we expected a 11\% improvement in performance. Surprisingly, the performance of Parabix in the 256-bit AVX implementation does not improve significantly and actually degrades for files with higher markup density ($\sim11\%$). dew.xml, on which bitwise-SIMD instructions reduced by 39\%, saw a performance which bitwise-SIMD instructions were reduced by 39\%, saw a performance improvement of 8\%.  We believe that this is primarily due to the intricacies of the first generation AVX implementation in \SB{}, with
• ## docs/HPCA2012/final_ieee/09-pipeline.tex

 r1743 Even if an application is infinitely parallelizable and thread synchronization costs are non-existent, all applications are constrained by the power and energy overheads incurred when utilizing multiple cores: the power and energy overheads incurred when utilizing multiple cores; as more cores are put to work, a proportional increase in power occurs. Unfortunately, due to the runtime overheads associated with The typical approach to handling data parallelism with multiple threads involves partitioning data uniformly across the threads. However XML involves partitioning data uniformly across the threads. However, XML parsing is inherently sequential, which makes it difficult to partition the data. Several attempts have been made to address this problem using a preparsing phase to help determine the tree structure problem. For example, using a preparsing phase to help determine the tree structure and to partition the XML document accordingly~\cite{dataparallel}. Another approach involved speculatively partitioning the data~\cite{Shah:2009} but partitioned Parabix-XML into four stages and assigned a core to each to stage. One of the key challenges was to determine which passes should be grouped together. By analyzing the latency and data dependencies of each of the passes in the single-threaded version of Parabix-XML (Column 3 in Table~\ref{pass_structure}), and assigned the passes to stages such that provided the maximal throughput. should be grouped together. We analyzed the latency and data dependencies of each of the passes in the single-threaded version of Parabix (Column 3 in Table~\ref{pass_structure}), and assigned the passes to stages to maximize throughput. controlling the overall size of the ring buffer. Whenever a faster stage runs ahead, it will effectively cause the ring buffer to fill up and force that stage to stall. Experiments show that 6 entries of the force that stage to stall. Experiments show that six entries of the circular buffer gives the best performance. single-threaded version.  The 4-threaded version is $\simeq2\times$ faster compared to the single threaded version and achieves $\simeq2.7$ cycles per input byte by exploiting SIMD units of all $\simeq2.7$ cycles per input byte by exploiting the SIMD units of all \SB{}'s cores.  This performance approaches the 1 cycle per byte performance of custom hardware solutions~\cite{DaiNiZhu2010}. Parabix
• ## docs/HPCA2012/final_ieee/10-related.tex

 r1768 % construction costs of the more flexible DOM (Document Object Model) % parsers \cite{Perkins05}.  Nicola and John specifically identified the traditional method of XML parsing as a threat to database performance and outlined a number of potential directions for improving performance \cite{NicolaJohn03}.  The commercial importance XML parsing as a threat to database performance  \cite{NicolaJohn03} outlines a number of potential directions for improving performance.  The commercial importance of XML parsing has spurred the development of numerous multi-threaded and hardware-based approaches: Multithreaded XML techniques include \cite{DaiNiZhu2010}. Others have explored the design of custom hardware for bit parallel operations for text search in network processors~\cite{tan-sherwood-isca-2005}. Intel's SSE4 instructions targeted processors~\cite{tan-sherwood-isca-2005}. Intel's SSE4.2 instructions targeted XML parsers, but these have not seen widespread use because of portability concerns and the programming challenges that accompany low level SSE2 instructions and proposed an inductive doubling instruction set ~\cite{CameronLin2009}. In this paper, we have developed a generalized parabix architecture and have described the software tool chain that Parabix architecture and have described the software tool chain that programmers can use to build scalable text processing applications on commodity multicores. We have explored in the detail the tradeoffs

• ## docs/HPCA2012/final_ieee/final.log

 r1768 This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009/Debian) (format=pdflatex 2011.10.18)  8 DEC 2011 12:16 This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009/Debian) (format=pdflatex 2011.4.5)  13 DEC 2011 16:49 entering extended mode %&-line parsing enabled. LaTeX2e <2009/09/24> Babel and hyphenation patterns for english, usenglishmax, dumylang, noh yphenation, farsi, arabic, croatian, bulgarian, ukrainian, russian, czech, slov ak, danish, dutch, finnish, french, basque, ngerman, german, german-x-2009-06-1 9, ngerman-x-2009-06-19, ibycus, monogreek, greek, ancientgreek, hungarian, san skrit, italian, latin, latvian, lithuanian, mongolian2a, mongolian, bokmal, nyn orsk, romanian, irish, coptic, serbian, turkish, welsh, esperanto, uppersorbian , estonian, indonesian, interlingua, icelandic, kurmanji, slovenian, polish, po rtuguese, spanish, galician, catalan, swedish, ukenglish, pinyin, loaded. yphenation, loaded. (./preamble-final-ieee.tex (/usr/share/texmf-texlive/tex/latex/base/article.cls Document Class: article 2007/10/19 v1.4h Standard LaTeX document class (Font)              Font shape OT1/pcr/b/n' tried instead on input line 41. [3] Overfull \hbox (9.88208pt too wide) in paragraph at lines 228--238 [] [] Overfull \hbox (15.88206pt too wide) in paragraph at lines 249--277 Overfull \hbox (9.88208pt too wide) in paragraph at lines 227--237 [] [] Overfull \hbox (15.88206pt too wide) in paragraph at lines 248--276 [] [] [5] Underfull \hbox (badness 1286) in paragraph at lines 101--114 Underfull \hbox (badness 1286) in paragraph at lines 100--113 [] \OT1/ptm/b/n/10 En-ergy Mea-sure-ment:[] \OT1/ptm/m/n/10 A key ben-e-fit of the Para-bix [] [7 <./plots/corei3_BM.pdf>] File: plots/corei3_TOT.pdf Graphic file (type pdf) File: plots/corei3_energy.pdf Graphic file (type pdf) ) (./06-scalability.tex [7 <./plots/corei3_BM.pdf> ] ) (./06-scalability.tex File: plots/Parabix2_all_platform.pdf Graphic file (type pdf) Overfull \hbox (7.22688pt too wide) in paragraph at lines 37--39 Overfull \hbox (7.22688pt too wide) in paragraph at lines 33--35 [] [] File: plots/Markup_density_Intel.pdf Graphic file (type pdf) [8 <./plots/corei3_TOT.pdf> <./plots/corei 3_energy.pdf> <./plots/Parabix2_all_platform.pdf>] Underfull \hbox (badness 1210) in paragraph at lines 77--88 \OT1/ptm/m/n/10 1Ghz single-core, dual-issue, su-per-scalar mi-cro-pro-ces-sor. [] File: plots/InsMix.pdf Graphic file (type pdf) ) (./07-avx.tex [9 <./plots/arm_TOT.pdf> <./plots/Markup_density_Arm.pdf> <./plot s/Markup_density_Intel.pdf>] (./07-avx.tex [8 <./plots/Parabix2_all_platform.pdf>] [9 <./plots/corei3_TOT.pd f> <./plots/corei3_energy.pdf> <./plots/arm_TOT.pdf> <./plots/Markup_density_Ar m.pdf> <./plots/Markup_density_Intel.pdf> <./plots/InsMix.pdf>] File: plots/avx.pdf Graphic file (type pdf) Overfull \hbox (7.22688pt too wide) in paragraph at lines 104--105 [] [] ) (./09-pipeline.tex [10 <./plots/InsMix.pdf> <./plots/avx.pdf>] Underfull \hbox (badness 1072) in paragraph at lines 76--85 ) (./09-pipeline.tex [10 <./plots/avx.pdf>] Underfull \hbox (badness 1072) in paragraph at lines 75--84 []\OT1/ptm/m/n/10 Figure 14[] demon-strates the per-for-mance im-prove-ment [] File: plots/pipeline.pdf Graphic file (type pdf) Overfull \hbox (7.22688pt too wide) in paragraph at lines 99--101 [] [] ) (./10-related.tex) (./11-conclusions.tex [11 <./plots/pipeline.pdf>]) Overfull \hbox (7.22688pt too wide) in paragraph at lines 98--100 [] [] ) (./10-related.tex [11 <./plots/pipeline.pdf>]) (./11-conclusions.tex) (./final.bbl Underfull \hbox (badness 1137) in paragraph at lines 17--22 [] [12] Missing character: There is no Ã in font ptmr7t! Missing character: There is no š in font ptmr7t! ) [12] (./final.aux) ) ) [13 ] (./final.aux) ) Here is how much of TeX's memory you used: 3934 strings out of 493848 54935 string characters out of 1152822 119286 words of memory out of 3000000 7039 multiletter control sequences out of 15000+50000 3934 strings out of 495061 54935 string characters out of 1182622 118305 words of memory out of 3000000 6940 multiletter control sequences out of 15000+50000 69892 words of font info for 168 fonts, out of 3000000 for 9000 717 hyphenation exceptions out of 8191 31 hyphenation exceptions out of 8191 38i,12n,38p,1456b,370s stack positions out of 5000i,500n,10000p,200000b,50000s {/usr/share/texmf-texlive/fonts/enc/dvips/base/8r.enc} Output written on final.pdf (12 pages, 517924 bytes). {/usr/share/texmf-texlive/fonts/enc/dvips/base/8r.enc} Output written on final.pdf (13 pages, 518284 bytes). PDF statistics: 275 PDF objects out of 1000 (max. 8388607) 279 PDF objects out of 1000 (max. 8388607) 0 named destinations out of 1000 (max. 500000) 61 words of extra memory for PDF output out of 10000 (max. 10000000)
Note: See TracChangeset for help on using the changeset viewer.