# Changeset 1348

Ignore:
Timestamp:
Aug 23, 2011, 1:02:30 AM (8 years ago)
Message:

new abstract for new intro

Location:
docs/HPCA2012
Files:
4 edited

Unmodified
Removed
• ## docs/HPCA2012/00-abstract.tex

 r1302 XML is a set of rules for the encoding of documents in machine-readable form. The simplicity and generality of the rules make it widely used in web services and database systems.  Traditional XML parsers are built around a byte-at-a-time processing model where each character token of an XML document is examined in sequence.  Unfortunately, the byte-at-a-time sequential model is a performance barrier in more demanding applications, is energy-inefficient, and makes poor use of the wide SIMD registers and other parallelism features of modern processors. In modern applications text files are employed widely. For example, XML files provide data storage in human readable format and are widely used in web services, database systems, and mobile phone SDKs. Traditional text processing tools are built around a byte-at-a-time processing model where each character token of a document is examined. The byte-at-a-time model is highly challenging for commodity processors. It includes many unpredictable input-dependent branches which cause pipeline squashes and stalls. Furthermore, typical text processing tools perform few operations per processed character and experience high cache miss rate when parsing the file. Overall, parsing text in important domains like XML processing require high performance and hardware designers have adopted customized hardware and ASIC solutions. This paper assesses the energy and performance of a new approach to XML parsing, based on parallel bit stream technology, and as implemented on successive software generations of the Parabix XML parser. In Parabix, we first convert character streams into sets of parallel bit streams. We then exploit the SIMD operations prevalent on commodity-level hardware for performance. The first generation Parabix1 parser exploits the processor built-in $bitscan$ instructions over these streams to make multibyte moves but follows an otherwise sequential approach.  The second generation Parabix2 technology adds further parallelism by replacing much of the sequential bit scanning with a parallel scanning approach based on bit stream addition.  We evaluate Parabix1 and Parabix2 against two widely used XML parsers, James Clark's Expat and Apache's Xerces, and across three generations of x86 machines, including the new Intel \SB{}.  We show that Parabix2's speedup is 2$\times$--7$\times$ over Expat and Xerces.  In stark contrast to the energy expenditures necessary to realize performance gains through multicore parallelism, we also show that our Parabix parsers deliver energy savings in direct proportion to the gains in performance.  In addition, we assess the scalability advantages of SIMD processor improvements across Intel processor generations, culminating with an evaluation of the 256-bit AVX technology in \SB{} versus the now legacy 128-bit SSE technology. % In this paper on commodity. % We expose through a toolchain. % We demonstrate what can be achieved with branches etc. % We study various tradeoffs. % Finally we show the benefits can be stacked In this paper we enable text processing applications to effectively use commodity processors. We introduce Parabix (Parallel Bitstream) technology, a software runtime and execution model that applications to exploit modern SIMD instructions extensions for high performance text processing. Parabix enables the application developer to write constructs assuming unlimited SIMD data parallelism. Our runtime translator generates code based on machine specifics (e.g., SIMD register widths) to realize the programmer specifications.  The key insight into efficient text processing in Parabix is the data organization. It transposes the sequence of 8-byte characters into sets of 8 parallel bit streams which then enables us to operate on multiple characters with a single bit-parallel SIMD operators. We demonstrate the features and efficiency of parabix with a XML parsing application. We evaluate Parabix-based XML parser against two widely used XML parsers, Expat and Apache's Xerces, and across three generations of x86 processors, including the new Intel \SB{}.  We show that Parabix's speedup is 2$\times$--7$\times$ over Expat and Xerces. We observe that Parabix overall makes efficient use of intra-core parallel hardware on commodity processors and supports significant gains in energy. Using Parabix, we assess the scalability advantages of SIMD processor improvements across Intel processor generations, culminating with a look at the latex 256-bit AVX technology in \SB{} versus the now legacy 128-bit SSE technology. As part of this study we also preview the Neon extensions on ARM processors. Finally, we partition the XML program into pipeline stages and demonstrate that thread-level parallelism exploits SIMD units scattered across the different cores and improves performance (2$\times$ on 4 cores) at same energy levels as the single-thread version.
• ## docs/HPCA2012/01-intro.tex

 r1342 \section{Introduction} We have now long since reached the limit to classical Dennard voltage scaling~\cite{}, that enabled us to keep all of transistors afforded by Moore's law active. This has already resulted in a rethink of the way general-purpose processors are built: processor frequencies have remained stagnant over the last 5 years with the capability to boost core speeds on Intel multicores only if other cores on the chip are shut off. Chip makers strive to achieve energy efficient computing by operating at more optimal core frequencies and aim to increase performance with larger number of cores. Unfortunately, given the limited levels of parallelism that can be found in applications~\cite{blake-isca-2010}, it is not certain how many cores can be productively used in scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting parallelism across multiple cores tends to require heavweight threads that are difficult to manage and synchronize. We have now long since reached the limit to classical Dennard voltage scaling that enabled us to keep all of transistors afforded by Moore's law active. This has already resulted in a rethink of the way general-purpose processors are built: processor frequencies have remained stagnant over the last 5 years with the capability to boost core speeds on Intel multicores only if other cores on the chip are shut off. Chip makers strive to achieve energy efficient computing by operating at more optimal core frequencies and aim to increase performance with larger number of cores. Unfortunately, given the limited levels of parallelism that can be found in applications~\cite{blake-isca-2010}, it is not certain how many cores can be productively used in scaling our chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting parallelism across multiple cores tends to require heavweight threads that are difficult to manage and synchronize. The desire to improve the overall efficiency of computing is pushing computing efficiency of commodity processors. In this paper, we tackle the infamous thirteenth dwarf'' (parsers/finite state machines) that is considered to be the hardest application class to parallelize~\cite{Asanovic:EECS-2006-183} and show how Parabix, a novel software architecture, tool chain and run-time environment can indeed be used to dramatically improve parsing efficiency on commodity processors. Based on the concept of transposing byte-oriented character data into parallel bit streams for the individual bits of each byte, the Parabix framework exploits the SIMD extensions (SSE/AVX on x86, Neon on ARM) on commodity processors to process hundreds of character positions in an input stream simultaneously.  To achieve transposition, Parabix exploits sophisticated SIMD instructions that enable data elements to be packed and unpacked from registers in a regular manner which improve the overall cache access behavior of the application resulting in significantly fewer misses and better utilization. Parabix also dramatically reduces branches in parsing code resulting in a more efficient pipeline and substantially improves register/cache utilization which minimizes energy wasted on data transfers. We study Parabix technology in application to the problem of XML parsing and develop several implementations for different computing platforms. XML is a particularly interesting application; it is a standard of the web consortium that provides a common framework for encoding and In this paper, we tackle the infamous thirteenth dwarf'' (parsers/finite state machines) that is considered to be the hardest application class to parallelize~\cite{Asanovic:EECS-2006-183} and show how Parabix, a novel software architecture, tool chain and run-time environment can indeed be used to dramatically improve parsing efficiency on commodity processors.  Based on the concept of transposing byte-oriented character data into parallel bit streams for the individual bits of each byte, the Parabix framework exploits the SIMD extensions (SSE/AVX on x86, Neon on ARM) on commodity processors to process hundreds of character positions in an input stream simultaneously.  To achieve transposition, Parabix exploits sophisticated SIMD instructions that enable data elements to be packed and unpacked from registers in a regular manner which improve the overall cache access behavior of the application resulting in significantly fewer misses and better utilization.  Parabix also dramatically reduces branches in parsing code resulting in a more efficient pipeline and substantially improves register/cache utilization which minimizes energy wasted on data transfers. We apply Parabix technology to the problem of XML parsing and develop several implementations for different computing platforms.  XML is a particularly interesting application; it is a standard of the web consortium that provides a common framework for encoding and communicating data.  XML provides critical data storage for applications ranging from Office Open XML in Microsoft Office to NDFD Castor XML in the Martian Rovers, a XML data in Android phones.  XML parsing efficiency is important for multiple application areas; in server workloads the key focus in on overall transactions per second while in applications in the network switches and cell phones latency and the energy cost of parsing is of paramount importance.   Traditional software-based XML parsers have many inefficiencies due to complex input-dependent branching structures leading to considerable branch misprediction penalties as well as poor use of memory bandwidth and data caches due to byte-at-a-time processing and multiple buffering.  XML ASIC chips have been around for over 6 years, but typically lag behind CPUs in technology due to cost constraints. Our focus is how much can we improve performance of the XML parser on commodity processors with Parabix technology. server workloads the key focus in on overall transactions per second, while in applications in the network switches and cell phones, latency and the energy are of paramount importance.  Traditional software-based XML parsers have many inefficiencies due to complex input-dependent branching structures leading to considerable branch misprediction penalties as well as poor use of memory bandwidth and data caches due to byte-at-a-time processing and multiple buffering. XML ASIC chips have been around for over 6 years, but typically lag behind CPUs in technology due to cost constraints. Our focus is how much can we improve performance of the XML parser on commodity processors with Parabix technology. In the end, as summarized by
• ## docs/HPCA2012/09-pipeline.tex

 r1347 Moreover, using mulitiple cores, we can further improve the performance of Parabix while keeping the energy consumption at the same level. A typical approach to parallelizing software, data parallelism, requires nearly independent data, However, the nature of XML files makes them hard to partition nicely for data parallelism. Several approaches have been used to address this problem. A preparsing phase has been proposed to help partition the XML document \cite{dataparallel}. The goal of this preparsing is to determine the tree structure of the XML document so that it can be used to guide the full parsing in the next phase. Another data parallel algorithm is called ParDOM \cite{Shah:2009}. It first builds partial DOM node tree structures for each data segments and then link them using preorder numbers that has been assigned to each start element to determine the ordering among siblings and a stack to manage the parent-child relationship between elements. A typical approach to parallelizing software, data parallelism, requires nearly independent data, However, the nature of XML files makes them hard to partition nicely for data parallelism.  Several approaches have been used to address this problem.  A preparsing phase has been proposed to help partition the XML document \cite{dataparallel}.  The goal of this preparsing is to determine the tree structure of the XML document so that it can be used to guide the full parsing in the next phase.  Another data parallel algorithm is called ParDOM \cite{Shah:2009}.  It first builds partial DOM node tree structures for each data segments and then link them using preorder numbers that has been assigned to each start element to determine the ordering among siblings and a stack to manage the parent-child relationship between elements. Data parallelism approaches introduce a lot of overheads to solve the data dependencies between segments. Therefore, instead of partitioning the data into segments and assigning different data segments to different cores, we propose a pipeline parallelism strategy that partitions the process into several stages and let each core work with one single stage. Data parallelism approaches introduce a lot of overheads to solve the data dependencies between segments.  Therefore, instead of partitioning the data into segments and assigning different data segments to different cores, we propose a pipeline parallelism strategy that partitions the process into several stages and let each core work with one single stage. The interface between stages is implemented using a circular array, where each entry consists of all ten data structures for one segment as listed in Table \ref{pass_structure}. Each thread keeps an index of the array ($I_N$), which is compared with the index ($I_{N-1}$) kept by its previous thread before processing the segment. If $I_N$ is smaller than $I_{N-1}$, thread N can start processing segment $I_N$, otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is larger than $I_N$. The time consumed by continuously loading the value of $I_{N-1}$ and comparing it with $I_N$ will be later referred as stall time. When a thread finishes processing the segment, it increases the index by one. The interface between stages is implemented using a circular array, where each entry consists of all ten data structures for one segment as listed in Table \ref{pass_structure}.  Each thread keeps an index of the array ($I_N$), which is compared with the index ($I_{N-1}$) kept by its previous thread before processing the segment.  If $I_N$ is smaller than $I_{N-1}$, thread N can start processing segment $I_N$, otherwise the thread keeps reading $I_{N-1}$ until $I_{N-1}$ is larger than $I_N$.  The time consumed by continuously loading the value of $I_{N-1}$ and comparing it with $I_N$ will be later referred as stall time.  When a thread finishes processing the segment, it increases the index by one. \begin{table*}[t]
• ## docs/HPCA2012/main.tex

 r1343 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ACM title header format \title{\vspace{-30pt} Boosting the Efficiency of Text Processing on Commodity Processors :\\ The Parabix Story. \title{\vspace{-30pt} Boosting the Efficiency of Text Processing on Commodity Processors: The Parabix Story % % \thanks{%
Note: See TracChangeset for help on using the changeset viewer.