Changeset 1350 for docs

Ignore:
Timestamp:
Aug 23, 2011, 11:42:04 AM (8 years ago)
Message:

New conclusion

Location:
docs/HPCA2012
Files:
6 edited

Unmodified
Removed
• docs/HPCA2012/00-abstract.tex

 r1349 In modern applications text files are employed widely. For example, XML files provide data storage in human readable format and are widely used in web services, database systems, and mobile phone SDKs. Traditional text processing tools are built around a byte-at-a-time processing model where each character token of a document is examined. The byte-at-a-time model is highly challenging for commodity processors. It includes many unpredictable input-dependent branches which cause pipeline squashes and stalls. Furthermore, typical text processing tools perform few operations per processed character and experience high cache miss rates when parsing the file. Overall, parsing text in important domains like XML processing requires high performance motivating hardware designers to adopt customized hardware and ASIC solutions. used in applications ranging from database systems to mobile phone SDKs.  Traditional text processing tools are built around a byte-at-a-time processing model where each character token of a document is examined. The byte-at-a-time model is highly challenging for commodity processors. It includes many unpredictable input-dependent branches which cause pipeline squashes and stalls. Furthermore, typical text processing tools perform few operations per processed character and experience high cache miss rates. Overall, parsing text in important domains like XML processing requires high performance motivating hardware designers to adopt ASIC solutions. % In this paper on commodity. In this paper, we enable text processing applications to effectively use commodity processors. We introduce Parabix (Parallel Bitstream) technology, a software runtime and execution model that allows applications to exploit modern SIMD instructions extensions for high performance text processing. Parabix enables the application developer to write constructs assuming unlimited SIMD data parallelism. Our runtime translator generates code based on machine specifics (e.g., SIMD register widths) to realize the programmer specifications.  The key insight into efficient text processing in Parabix is the data organization. It transposes the sequence of 8-bit characters into sets of 8 parallel bit streams which then enables us to operate on multiple characters with single bit-parallel SIMD operators. We demonstrate the features and efficiency of parabix with a XML parsing application. We evaluate a Parabix-based XML parser against two widely used XML parsers, Expat and Apache's Xerces, and across three generations of x86 processors, including the new Intel \SB{}.  We show that Parabix's speedup is 2$\times$--7$\times$ over Expat and Xerces. We observe that Parabix overall makes efficient use of intra-core parallel hardware on commodity processors and supports significant gains in energy. Using Parabix, we assess the scalability advantages of SIMD processor improvements across Intel processor generations, culminating with a look at the latex 256-bit AVX technology in \SB{} versus the now legacy 128-bit SSE technology. As part of this study we also preview the Neon extensions on ARM processors. Finally, we partition the XML program into pipeline stages and demonstrate that thread-level parallelism exploits SIMD units scattered across the different cores and improves performance (2$\times$ on 4 cores) at same energy levels as the single-thread version. technology, a software runtime and execution model that allows applications to exploit modern SIMD instructions extensions for high performance text processing. Parabix enables the application developer to write constructs assuming unlimited SIMD data parallelism and Parabix's runtime translator generates code based on machine specifics (e.g., SIMD register widths).  The key insight into efficient text processing in Parabix is the data organization. Parabix transposes the sequence of character bytes into sets of 8 parallel bit streams which then enables us to operate on multiple characters with single bit-parallel SIMD operators. We demonstrate the features and efficiency of parabix with a XML parsing application. We evaluate a Parabix-based XML parser against two widely used XML parsers, Expat and Apache's Xerces, and across three generations of x86 processors, including the new Intel \SB{}.  We show that Parabix's speedup is 2$\times$--7$\times$ over Expat and Xerces. We observe that Parabix overall makes efficient use of intra-core parallel hardware on commodity processors and supports significant gains in energy. Using Parabix, we assess the scalability advantages of SIMD processor improvements across Intel processor generations, culminating with a look at the latex 256-bit AVX technology in \SB{} versus the now legacy 128-bit SSE technology. Finally, we partition the XML program into pipeline stages and demonstrate that thread-level parallelism exploits SIMD units scattered across the different cores and improves performance (2$\times$ on 4 cores) at same energy levels as the single-thread version.
• docs/HPCA2012/10-conclusions.tex

 r1339 \section{Conclusion} \label{section:conclusion} This paper has examined energy efficiency and performance characteristics of four XML parsers considered over three generations of Intel processor architecture and shown that parsers based on parallel bit stream technology have dramatically better performance, energy efficiency and scalability than traditional byte-at-a-time parsers widely deployed in current software.  Based on a novel application of the short vector SIMD technology commonly found in commodity processors of all kinds, parallel bit stream technology scales well with improvements in processor SIMD capabilities.  With the recent introduction of the first generation of Intel processors that incorporate AVX technology, the change to 3-operand form SIMD operations has delivered a substantial benefit for the Parabix2 parsers simply through recompilation. Restructuring of Parabix2 to take advantage of the 256-bit SIMD capabilities also delivered a substantial reduction in instruction count, but without corresponding performance benefits in the first generation of AVX implementations. % In this paper we presented a framework. % We demonstrated on XML. % We showed benefits % We analyzed SIMD % We stacked multithreading % We have released it. % Future research There are many directions for further research. These include compiler and tools technology to automate the low-level programming tasks inherent in building parallel bit stream applications, widening the research by applying the techniques to other forms of text analysis and parsing, and further investigation of the interaction between parallel bit stream technology and processor architecture.  Two promising avenues include investigation of GPGPU approaches to parallel bit stream technology and the leveraging of the intraregister parallelism inherent in this approach to also take advantage of the intrachip parallelism of multicore processors. In this paper we presented Parabix a software runtime framework for exploiting SIMD data units found on commodity processors for text processing.  The Parabix framework allows to focus on exposing the parallelism in their application assuming an infinite resource abstract SIMD machine without worrying about or having to change code to handle processor specifics (e.g., 128 bit SIMD SSE vs 256 bit SIMD on AVX). We applied Parabix technology to a widely deployed application; XML parsing and demonstrate the efficiency gains that can be obtained on commodity processors. Compared to the conventional XML parsers, Expat and Xerces, we achieve 2$\times$---7$\times$ improvement in performance and average x$\times$ improvement in energy. We achieve high compute efficiency with an overall ?$\times$ reduction in branches, ?$\times$ reduction in branche mispredictions, ?%\times$reduction in LLC misses, and increase in data parallelism processing upto 128 characters with a single operation. We used the Parabix framework and XML parsers to study the features of the new 256 bit AVX extension in Intel processors. We find that while the move to 3-operand instructions deliver significant benefit the wider operations in some cases have higher overheads compared to the existing 128 bit SSE operations. We also compare Intel's SIMD extensions against the ARM Neon. Note that Parabix allowed us to perform these studies without having to change the application source. Finally, we parallelized the Parabix XML parser to take advantage of the SIMD units in every core on the chip. We demonstrate that the benefits of thread-level-parallelism are complementary to the fine-grain parallelism we exploit; parallelized Parabix achieves a further 2$\times\$ improvement in performance.
• docs/HPCA2012/latex/iccv.sty

 r1335 \begin{tabular}[t]{c} %     \ificcvfinal\@author\else Anonymous HPCA submission\\ \vspace*{1pt}\\%This space will need to be here in the final copy, so don't squeeze it out for the review copy. %    \vspace*{1pt}\\%This space will need to be here in the final copy, so don't squeeze it out for the review copy. Paper ID \iccvPaperID %\fi \end{tabular} {% \centerline{\large\bf Abstract}% \vspace*{12pt}% % \vspace*{12pt}% \it% }
• docs/HPCA2012/main.tex

 r1348 \renewcommand\section{\@startsection {section}{1}{0pt}% {0.2\baselineskip}% {0.1\baselineskip}% {0.1\baselineskip}% {\normalfont\Large\bfseries\raggedright}% \maketitle \begin{abstract} %\centerline{\large\textbf{Abstract}} \begin{singlespace} %\bigskip %\begin{quotation} \emph{\input{00-abstract.tex}} %\end{quotation} \end{singlespace} \end{abstract} \input{09-pipeline.tex} \input{10-conclusions.tex} % tighten spacing: \let\oldthebibliography\thebibliography
• docs/HPCA2012/preamble-submit.tex

 r1335 \marginparsep 0in \marginparwidth 0in \topmargin -0.2in \topmargin -0.1in %\headheight 0in %\headsep 0in
Note: See TracChangeset for help on using the changeset viewer.