Changeset 1393 for docs

Ignore:
Timestamp:
Aug 30, 2011, 10:47:59 AM (8 years ago)
Message:

Minor bug fixes up to 04

Location:
docs/HPCA2012
Files:
9 edited

Unmodified
Removed
• docs/HPCA2012/00-abstract.tex

 r1360 In modern applications text files are employed widely. For example, XML files provide data storage in human readable format and are ubiquitous in applications ranging from database systems to mobile phone SDKs.  Traditional text processing tools are built around a ubiquitous in applications ranging from database systems to mobile phone SDKs.  Traditional text processing tools are built around a byte-at-a-time processing model where each character token of a document is examined. The byte-at-a-time model is highly challenging input-dependent branches which cause pipeline squashes and stalls. Furthermore, typical text processing tools perform few operations per processed character and experience high cache miss operations per character and experience high cache miss rates. Overall, parsing text in important domains like XML processing requires high performance motivating hardware designers to adopt ASIC requires high performance motivating the adoption of custom hardware solutions. In this paper, we enable text processing applications to effectively use commodity processors. We introduce Parabix (Parallel Bitstream) technology, a software toolkit and execution framework that allows applications to exploit modern SIMD instructions extensions for high performance text processing. Parabix enables the application developer to write constructs assuming unlimited SIMD data parallelism and Parabix's bitstream translator generates code based on machine specifics (e.g., SIMD register widths).  The key insight into efficient text processing in Parabix is the data organization. Parabix transposes the sequence of character bytes into sets of 8 parallel bit streams which then enables us to operate on multiple characters with single bit-parallel SIMD operators. We demonstrate the features and efficiency of Parabix with a XML parsing application. We evaluate a Parabix-based XML parser against two widely used XML parsers, Expat and Apache's Xerces, and across three generations of x86 processors, including the new Intel \SB{}.  We show that Parabix's speedup is 2$\times$--7$\times$ over Expat and Xerces. We observe that Parabix overall makes efficient use of intra-core parallel hardware on commodity processors and supports significant gains in energy. Using Parabix, we assess the scalability advantages of SIMD processor improvements across Intel processor generations, culminating with a look at the latest 256-bit AVX technology in \SB{} versus the now legacy 128-bit SSE technology. We also examine Parabix on mobile platforms using ARM processors with Neon SIMD extensions.  Finally, we partition the XML program into pipeline stages and demonstrate that thread-level parallelism exploits SIMD units scattered across the different cores and improves performance (2$\times$ on 4 cores) at same energy levels as the single-thread version. technology, a software toolchain and execution framework that allows applications to exploit modern SIMD instructions for high performance text processing. Parabix enables the application developer to write constructs assuming unlimited SIMD data parallelism and Parabix's bitstream translator generates code based on machine specifics (e.g., SIMD register widths).  The key insight into efficient text processing in Parabix is the data organization. Parabix transposes the sequence of character bytes into sets of 8 parallel bit streams which then enables us to operate on multiple characters with bit-parallel SIMD operations. We demonstrate the features and efficiency of Parabix with an XML parsing application. We evaluate the Parabix-based parser against two widely used XML parsers, Expat and Apache's Xerces. Parabix makes efficient use of intra-core SIMD hardware and demonstrates 2$\times$--7$\times$ speedup and 4$\times$ improvement in energy efficiency compared to the conventional parsers. We assess the scalability of SIMD implementations across three generations of x86 processors including the new \SB{}. We compare the 256-bit AVX technology in Intel \SB{} versus the now legacy 128-bit SSE technology and analyze the benefits and challenges of using the AVX extensions.  Finally, we partition the XML program into pipeline stages and demonstrate that thread-level parallelism enables the application to exploits SIMD units scattered across the different cores and improves performance (2$\times$ on 4 cores) at same energy levels as the single-thread version for the XML application.
• docs/HPCA2012/01-intro.tex

 r1373 \section{Introduction} We have now long since reached the limit to classical Dennard voltage scaling that enabled us to keep all of transistors afforded by Moore's law active. This has already resulted in a rethink of the way scaling that enabled us to keep all of transistors afforded by Moore's law active. This has already resulted in a rethink of the way general-purpose processors are built: processor frequencies have remained stagnant over the last 5 years with the capability to boost core speeds on Intel multicores only if other cores on the chip are shut off. Chip makers strive to achieve energy efficient computing by operating at more optimal core frequencies and aim to increase performance with a larger number of cores. Unfortunately, given the limited levels of parallelism that can be found in applications~\cite{blake-isca-2010}, it is not certain how many cores can be productively used in scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting parallelism across multiple cores tends to require heavy weight threads that are difficult to manage and synchronize. a core's frequency only if other cores on the chip are shut off. Chip makers strive to achieve energy efficient computing by operating at more optimal core frequencies and aim to increase performance with a larger number of cores. Unfortunately, given the limited levels of parallelism that can be found in applications~\cite{blake-isca-2010}, it is not certain how many cores can be productively used in scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting parallelism across multiple cores tends to require heavy weight threads that are difficult to manage and synchronize. The desire to improve the overall efficiency of computing is pushing In this paper, we tackle the infamous thirteenth dwarf'' (parsers/finite state machines) that is considered to be the hardest application class to parallelize~\cite{Asanovic:EECS-2006-183} and show how Parabix, a novel software architecture, tool chain and run-time environment can indeed be used to dramatically improve parsing efficiency on commodity processors.  Based on the concept of transposing byte-oriented character data into parallel bit streams for the individual bits of each byte, the Parabix framework exploits the SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM) application class to parallelize~\cite{Asanovic:EECS-2006-183}. We present Parabix, a novel execution framework and software run-time environment that can be used to dramatically improve the efficiency of text processing and parsing on commodity processors.  Parabix transposes byte-oriented character data into parallel bit streams for the individual bits of each character byte and then exploits the SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM) to process hundreds of character positions in an input stream simultaneously.  To achieve transposition, Parabix exploits applications ranging from Office Open XML in Microsoft Office to NDFD XML of the NOAA National Weather Service, from KML in Google Earth to Castor XML in the Martian Rovers, as well as ubiquitous XML data in Android phones.  XML parsing efficiency is important for multiple application areas; in server workloads the key focus in on overall transactions per second, while in applications in network switches and cell phones, latency and energy are of paramount importance.  Traditional software-based XML parsers have many inefficiencies including considerable branch misprediction penalties due to complex input-dependent branching structures as well as poor use of memory bandwidth and data caches due to byte-at-a-time processing and multiple buffering. XML ASIC chips have been around for over 6 years, but typically lag behind CPUs in technology due to cost constraints. Our focus is how much we can improve performance of the XML parser on commodity processors with Parabix technology. Castor XML in the Martian Rovers.  XML parsing efficiency is important for multiple application areas; in server workloads the key focus in on overall transactions per second, while in applications in network switches and cell phones, latency and energy are of paramount importance.  Traditional software-based XML parsers have many inefficiencies including considerable branch misprediction penalties due to complex input-dependent branching structures as well as poor use of memory bandwidth and data caches due to byte-at-a-time processing and multiple buffering.  XML ASIC chips have been around for over 6 years, but typically lag behind CPUs in technology due to cost constraints. Our focus is how much we can improve performance of the XML parser on commodity processors with Parabix technology. In the end, as summarized by parsing. 2) We compare  Parabix XML parsers against conventional parsers and assess the improvement in overall performance and energy efficiency on each platform.   We are the first to compare and contrast SSE/AVX extensions across multiple generation of Intel processors and show that there are performance challenges when using newer generation SIMD extensions, possibly due to their memory interface. We compare the ARM Neon extensions against the x86 SIMD extensions and comment on the latency of SIMD operations across these architectures. 2) We compare Parabix XML parsers against conventional parsers and assess the improvement in overall performance and energy efficiency on each platform.  We are the first to compare and contrast SSE/AVX extensions across multiple generation of Intel processors and show that there are performance challenges when using newer generation SIMD extensions. We compare the ARM Neon extensions against the x86 SIMD extensions and comment on the latency of SIMD operations across these architectures. 3) Finally, building on the SIMD parallelism of Parabix technology, The remainder of this paper is organized as follows. Section~\ref{section:background} presents background material on XML parsing and provides insight into the inefficiency of traditional parsers on mainstream processors.  Section~\ref{section:parabix} describes the Parabix architecture, tool chain and run-time environment. Section~\ref{section:parser} describes the application of the Parabix framework to the construction of an XML parser enforcing all the well-formedness rules of the XML specification.  Section~\ref{section:methodology} then describes the overall methodology of our performance and energy study. Section~\ref{section:baseline} presents a detailed performance evaluation on a \CITHREE\ processor as our primary evaluation platform, addressing a number of microarchitectural issues including cache misses, branch mispredictions, and SIMD instruction counts.  Section~\ref{section:scalability} examines scalability and performance gains through three generations of Intel architecture.  Section~\ref{section:avx} examines the extension of the Parabix technology to take advantage of Intel's new 256-bit AVX technology, while Section~\ref{section:neon} investigates the applications of this technology on mobile platforms using ARM processors with Neon SIMD extensions. Section~\ref{section:multithread} then looks at the multithreading of the Parabix XML parser using pipeline parallelism. Section~\ref{section:related} discusses related work, after which Section~\ref{section:conclusion} concludes the paper. Section~\ref{section:background} presents background material on XML parsing and provides insight into the inefficiency of traditional parsers on mainstream processors.  Section~\ref{section:parabix} describes the Parabix architecture, tool chain and run-time environment.  Section~\ref{section:parser} describes the application of the Parabix framework to the construction of an XML parser enforcing all the well-formedness rules of the XML specification. Section~\ref{section:baseline} presents a detailed performance analysis of Parabix on a \CITHREE\ system using hardware performance counters and compares it against conventional parsers. Section~\ref{section:scalability} compares the performance and energy efficiency of 128 bit SIMD extensions across three generations of intel processors and includes a comparison with the ARM Cortex-A8 processor.  Section~\ref{section:avx} examines the Intel's new 256-bit AVX technology and comments on the benefits and challenges compared to the 128-bit SSE instructions.  Finally, Section~\ref{section:multithread} looks at the multithreading of the Parabix XML parser which seeks to exploit the SIMD units scattered across multiple cores.
• docs/HPCA2012/02-background.tex

 r1374 XML can represent virtually any type of information (i.e., content) in a descriptive fashion. XML markup encodes a description of an XML document's storage layout and logical structure. Because XML is intended to be human-readable, XML markup tags are often verbose by design \cite{TR:XML}. Since XML is intended to be human-readable, markup tags are often verbose by design \cite{TR:XML}. For example, Figure \ref{fig:sample_xml} provides a standard product list encapsulated within an XML document. All content is highlighted in bold. Anything that is not content is considered markup. \begin{figure}[h] {\scriptsize {\footnotesize \begin{center} \begin{tabbing} \subsection{XML Parsers} % However, textual data tends to consist of variable-length items in generally unpredictable patterns \cite{Cameron2010}. Traditional XML parsers process an XML document sequentially, a single byte-at-a-time, from the first to the last character in the source text. Each character is examined to distinguish between the XML-specific markup, such as an opening angle bracket \verb<', and the content held within the document. The character that the parser is currently processing is commonly referred to its {\it cursor position}. As the parser moves its cursor through the source document, it alternates between markup scanning / validation and content processing operations. In other words, traditional XML parsers operate as finite-state machines that use byte comparisons to transition between data and metadata states. Each state transition indicates the context in which to interpret the subsequent characters. Unfortunately, textual data tends to consist of variable-length items sequenced in generally unpredictable patterns; thus any character could be a state transition until deemed otherwise. A major disadvantage of the sequential byte-at-a-time approach to XML parsing is that processing an XML document requires at least one conditional branch per byte of source text. For example, Xerces-C, which is the most popular open-source C++ based XML parser and the foundation of the Apache XML project \cite{xerces}, uses a series of nested switch statements and state-dependent flag tests to control the parsing logic of the program. Our analysis, which we detail in Section \ref{section:XML-branches}, found that Xerces-C requires between 6 - 13 branches per byte of XML to support this form of control structure (depending on the ratio of markup vs. content). Cache utilization is also significantly reduced due to the manner in which markup and content must be scanned and buffered for future use. For instance, Xerces-C incurs $\sim$100 L1 cache misses per 1000 bytes of XML. % Branch mispredictions are known % to signficantly degrade XML parsing performance in proportion to the markup density of the source document % \cite{CameronHerdyLin2008} (i.e., the ratio of markup vs. content). Traditional XML parsers process an XML document sequentially, a single byte-at-a-time, from the first to the last character in the source text.  Each character is examined to distinguish between the XML-specific markup, such as an left angle bracket \verb<', and the content held within the document.  The character that the parser is currently interpreting is commonly referred to its {\it cursor}. As the parser moves its cursor through the document, it alternates between markup scanning, validation, and content processing operations.  In other words, traditional XML parsers operate as finite-state machines that use byte comparisons to transition between data and metadata states. Each state transition indicates the context for subsequent characters. Unfortunately, textual data tends to consist of variable-length items sequenced in generally unpredictable patterns; thus any character could be a state transition until deemed otherwise. A major disadvantage of the sequential byte-at-a-time approach to XML parsing is that processing an XML document requires at least one conditional branch per byte of source text.  For example, Xerces-C, which forms the foundation for widely deployed the Apache XML project \cite{xerces}, uses a series of nested switch statements and state-dependent flag tests to control the parsing logic of the program.  Our analysis, which we detail in Section \ref{section:XML-branches}, found that Xerces requires between 6 - 13 branches per byte of XML to support this form of control flow, depending on the fraction of markup in the overall document.  Cache utilization is also significantly reduced due to the manner in which markup and content must be scanned and buffered for future use.  For instance, Xerces incurs $\sim$100 L1 cache misses per 1000 bytes of XML.  In general, while microarchitectural improvements may help the parser tide over some of these challenges (e.g., cache misses), the fundamental data and control flow in the parsers are ill suited for commodity processors and experience significant overhead.
• docs/HPCA2012/03-research.tex

 r1384 \label{section:parabix} This section presents an overview of the SIMD-based parallel bit stream text processing framework, \emph{Parabix}. The framework has three components: a unifying architectural view of text processing in terms of parallel bit streams, a tool chain for automating the generation of parallel bit stream code from higher-level specifications, and a run-time environment providing a portable SIMD programming abstraction, independent of the specific facilities available on particular target architectures. This section presents an overview of the SIMD-based parallel bit stream text processing framework, \emph{Parabix}.  The framework has three components: a unifying architectural view of text processing in terms of parallel bit streams, a tool chain for automating the generation of parallel bit stream code from higher-level specifications, and a run-time environment providing a portable SIMD programming abstraction, independent of the specific facilities available on particular target architectures. \subsection{Parallel Bit Streams} The fundamental difference between the Parabix framework and traditional text processing models is in how Parabix represents the source data.   Given a traditional byte-oriented text stream, Parabix first transposes the text data to a transform domain consisting of 8 parallel bit streams, known as {\em basis bit streams}.  In essence, each basis bit stream $b_{k}$ represents the stream of $k$-th bit of each byte in the source text.  That is, the $k$-th bit of $i$-th byte in the source text is in the $i$-th (bit) position of the $k$-th basis bit stream, $b_{k}$. For example, in Figure \ref{fig:BitstreamsExample}, we show how the ASCII string {\ttfamily b7\verb<A}'' is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been highlighted in this example. The fundamental difference between the Parabix framework and traditional text processing models is in how Parabix represents the source data.  Given a traditional byte-oriented text stream, Parabix first transposes the text data to a transform domain consisting of 8 parallel bit streams, known as {\em basis bit streams}.  In essence, each basis bit stream $b_{k}$ represents the stream of $k$-th bit of each byte in the source text.  That is, the $k$-th bit of $i$-th byte in the source text is in the $i$-th (bit) position of the $k$-th basis bit stream, $b_{k}$.  For example, in Figure\ref{fig:BitstreamsExample}, we show how the ASCII string {\ttfamily b7\verb<A}'' is represented as 8 basis bit streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been highlighted in this example. \begin{figure}[h] The advantage of the parallel bit stream representation is that we can use the 128-bit SIMD registers commonly found on commodity processors (e.g. SSE on Intel) to process 128 byte positions at a time using bitwise logic, shifting and other operations. use the 128-bit SIMD registers commonly found on commodity processors (e.g. SSE on Intel) to process 128 byte positions at a time using bitwise logic, shifting and other operations. Just as forward and inverse Fourier transforms are used to transform between the time and frequency domains in signal processing, bit stream transposition and inverse transposition provides byte space'' and bit space'' views of text.  The goal of the Parabix framework is to support efficient text processing using these two equivalent representations in the same way that efficient signal processing benefits from the use of the frequency domain in some cases and the time domain in others. between the time and frequency domains in signal processing, bit stream transposition and inverse transposition provides byte space'' and bit space'' views of text.  The goal of the Parabix framework is to support efficient text processing using these two equivalent representations in the same way that efficient signal processing benefits from the use of the frequency domain in some cases and the time domain in others. In the Parabix framework, basis bit streams are used as the starting point to determine other bit streams.   In particular, Parabix uses the basis bit streams to construct \emph{character-class bit streams} in which each $\tt 1$ bit indicates the presense of a significant character (or class of characters) in the parsing process. Character-class bit streams may then be used to compute \emph{lexical bit streams} and \emph{error bit streams}, which Parabix uses to process and validate the source document. The remainder of this section will discuss each type of bit stream. point to determine other bit streams.  In particular, Parabix uses the basis bit streams to construct \emph{character-class bit streams} in which each $\tt 1$ bit indicates the presense of a significant character (or class of characters) in the parsing process. Character-class bit streams may then be used to compute \emph{lexical bit streams} and \emph{error bit streams}, which Parabix uses to process and validate the source document. The remainder of this section will discuss each type of bit stream. {\bf Basis Bit Streams:} so that Parabix can efficiently produce the character-class bit streams. Using the SIMD capabilities of current commodity processors, the transposition process incurs an amortized cost of approximately 1 cycle per byte \cite{CameronHerdyLin2008}. of approximately 1 cycle per byte. %The size of $k$ is dependent on the code unit size of the text encoding format of the source document. A code unit is simply %a fixed number of bits (or bytes) used to represent a specific character (code point). Some encoding formats, such {\bf Character-class Bit Streams:} Typically, as text parsers process input data, they locate specific characters to determine if and when to transition between data and metadata parsing. Typically, as text parsers process input data, they locate specific characters to determine if and when to transition between data and metadata parsing. % For example, in a CSV file, any ,' or \textbackslash n' can indicate the start of a new column or row respectively. For example, in XML, any opening angle bracket character, \verb<', may indicate that we are starting a new markup tag. Traditional byte-at-a-time parsers find these characters by comparing the value of each code point with a set of known code points and branching appropriately when one is found, typically using an if or switch statement. Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms to do so correctly. For example, in XML, any opening angle bracket character, \verb<', may indicate that we are starting a new markup tag.  Traditional byte-at-a-time parsers find these characters by comparing the value of each code point with a set of known code points and branching appropriately when one is found, typically using an if or switch statement.  Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms to do so correctly. % However, a <' is legal within an XML comment so not every <' necessarily means that we are opening a new tag. Character-class bit streams allow us to perform up to 128 comparisons'' in parallel with a single operation by using a series of boolean-logic operations \footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operations.} to merge multiple basis bit streams into a single character-class stream that marks the positions of key characters with a $1$. For example, a character is an \verb<' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$. Classes of characters can be found with similar formulas. For example, a character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$. An important observation here is that a range of characters can sometimes take fewer operations and require fewer basis bit streams to compute than individual characters. Finding optimal solutions to all character-classes is non-trivial and goes beyond the scope of this paper. Character-class bit streams allow us to perform up to 128 comparisons'' in parallel with a single operation by using a series of boolean-logic operations \footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operations.}  to merge multiple basis bit streams into a single character-class stream that marks the positions of key characters with a $1$.  For example, a character is an \verb<' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.  Classes of characters can be found with similar formulas.  For example, a character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An important observation here is that a range of characters can sometimes take fewer operations and require fewer basis bit streams to compute than individual characters.  Finding optimal solutions to all character-classes is non-trivial and goes beyond the scope of this paper. % The advantage of character-class bit streams over traditional byte streams'' is that {\bf Lexical and Error Bit Streams:} To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using a mixture of both boolean logic and integer math. Lexical bit streams typically mark multiple current parsing positions. Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel. Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness issues found during the parsing process. The presense of a $\tt 1$ bit in an error stream tends to mean that the lexical stream cannot be trusted to be completely accurate and Parabix may need to perform some sequential parsing on that section to determine the cause and severity of the error. To form lexical bit streams, we have to introduce a few new operations: Advance and ScanThru. The $\texttt{Advance}$ operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits, and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right. $\texttt{ScanThru}$ accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent $\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$. For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb<[a-zA-Z]+> and wish to find all instances of it in the source text. We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all \verb>'s, and $C_{2}$, which marks all \verb<'s. In $L_0$ the position of every \verb<' is advanced by one to locate the first character of each token. By computing $E_{0}$, the parser notes that \verb<>'' does not match the expected pattern. To find the end positions of each token, the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each token ends with a \verb>' and discovers that \verb and wish to find all instances of it in the source text.  We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all \verb>'s, and $C_{2}$, which marks all \verb<'s. In $L_0$ the position of every \verb<' is advanced by one to locate the first character of each token. By computing $E_{0}$, the parser notes that \verb<>'' does not match the expected pattern. To find the end positions of each token, the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each token ends with a \verb>' and discovers that \verb::f(a,b) denotes the general pattern for a vertical SIMD operation yielding an output SIMD vector $v$, given two input SIMD vectors $a$ and $b$. For each field $v_i$ of $v$, the value computed is $f(a_i, b_i)$. For example, given 128-bit SIMD vectors, \verbsimd<8>::add(a,b) represents the simultaneous addition of sixteen 8-bit fields. These operations were originally developed for 128-bit Altivec operations on Power PC as well as 64-bit MMX and 128-bit SSE operations on Intel but have recently extended to support the new 256-bit AVX operations on Intel as well as the 128-bit NEON operations on the ARM architecture. The Parabix architecture also includes run-time libraries that support a machine-independent view of basic SIMD operations, as well as a set of core function libraries.  For machine-independence, we program all operations using an abstract SIMD machine.  The abstract machine supports all power-of-2 field widths up to the full SIMD register width on a target machine.  Let $w = 2k$ be the field width in bits. Let $f$ be a basic binary operation defined on $w$-bit quantities producing an $w$-bit result. Let $W$ be the SIMD vector size in bits where $W = 2K$.  Then the C++ template notation \verbv=simd::f(a,b) denotes the general pattern for a vertical SIMD operation yielding an output SIMD vector $v$, given two input SIMD vectors $a$ and $b$. For each field $v_i$ of $v$, the value computed is $f(a_i, b_i)$.  For example, given 128-bit SIMD vectors, \verbsimd<8>::add(a,b) represents the simultaneous addition of sixteen 8-bit fields. These operations were originally developed for 128-bit Altivec operations on Power PC as well as 64-bit MMX and 128-bit SSE operations on Intel but have recently extended to support the new 256-bit AVX operations on Intel as well as the 128-bit NEON operations on the ARM architecture.
• docs/HPCA2012/03b-research.tex

 r1388 \end{figure} This section describes the implementation of the Parabix XML parser using the framework explained in Section \ref{section:parabix}. This section describes the implementation of the Parabix XML parser. Figure \ref{parabix_arch} shows its overall structure set up for well-formedness checking.  The input file is consist entirely of parallel bit stream operations.  Of these, the Classification function consists of XML character class definitions that are generated using ccc, while much of the U8\_Validation that are generated using our character class compiler \textit{ccc}, while much of the U8\_Validation similarly consists of UTF-8 byte class definitions that are also generated by ccc.  The remainder of these functions are programmed
• docs/HPCA2012/04-methodology.tex

 r1378 \paragraph{XML Parsers}\label{parsers} We evaluate the Parabix XML parser described above against two widely available open-source parsers,  Xerces-C++, and Expat. Each of the parsers is evaluated on the task of implementing the parsing and well-formedness checking requirements of the full XML 1.0 specification\cite{TR:XML}.    Xerces-C++ version 3.1.1 (SAX) \cite{xerces} is a validating open source XML parser written in C++ available as part of the the Apache project. However, we use the WFXML scanner of Xerces to avoid the costs of validation and also use the SAX interface to avoid the costs of DOM tree construction. Expat version 2.0.1 \cite{expat} is a non-validating XML parser library written in C. We evaluate the Parabix XML parser described above against two widely available open-source parsers, Xerces-C++, and Expat.  Each of the parsers is evaluated on the task of implementing the parsing and well-formedness checking requirements of the full XML 1.0 specification\cite{TR:XML}.  Xerces-C++ version 3.1.1 (SAX) \cite{xerces} is a validating open source XML parser written in C++ available as part of the the Apache project.  To ensure a fair comparison, we use the WFXML scanner of Xerces to eliminate the overheads of validation and also use the SAX interface to avoid the overheads costs of DOM tree construction.  Expat version 2.0.1 \cite{expat} is a non-validating XML parser library written in C. traditional recursive descent XML parser implementations.  We use a mixture of document-oriented and data-oriented XML files in our study to provide workloads with a full spectrum of markup densities. to  analyze workloads with a full spectrum of markup densities. Table \ref{XMLDocChars} shows the document characteristics of the XML UTF-8 encoding of Japanese and German characters respectively.  The remaining data files are data-oriented XML documents and consist entirely of single byte $7$-bit encoded ASCII characters. entirely of single byte encoded ASCII characters. \begin{table*}
• docs/HPCA2012/06-scalability.tex

 r1389 \label{relative_performance_intel} } \caption{Parabix performance on mobile platforms} \caption{Comparaing Parabix on ARM and Intel.} \end{figure}
• docs/HPCA2012/main.tex

 r1380 %{$\{$ashriram,sandhya,kshen$\}$@cs.rochester.edu} %} \date{} \def\iccvPaperID{160} \begin{document} \maketitle
Note: See TracChangeset for help on using the changeset viewer.