Aug 23, 2011, 11:42:04 AM (8 years ago)

New conclusion

1 edited


  • docs/HPCA2012/10-conclusions.tex

    r1339 r1350  
    3 This paper has examined energy efficiency and performance
    4 characteristics of four XML parsers considered over three
    5 generations of Intel processor architecture and shown that
    6 parsers based on parallel bit stream technology have dramatically
    7 better performance, energy efficiency and scalability than
    8 traditional byte-at-a-time parsers widely deployed in current
    9 software.  Based on a novel application of the short vector
    10 SIMD technology commonly found in commodity processors of
    11 all kinds, parallel bit stream technology scales well with
    12 improvements in processor SIMD capabilities.  With the recent
    13 introduction of the first generation of Intel processors that
    14 incorporate AVX technology, the change to 3-operand
    15 form SIMD operations has delivered a substantial benefit
    16 for the Parabix2 parsers simply through recompilation.
    17 Restructuring of Parabix2 to take advantage of the 256-bit SIMD
    18 capabilities also delivered a substantial reduction in
    19 instruction count, but without corresponding performance
    20 benefits in the first generation of AVX implementations.
     3% In this paper we presented a framework.
     4% We demonstrated on XML.
     5% We showed benefits
     6% We analyzed SIMD
     7% We stacked multithreading
     8% We have released it.
     10% Future research
    23 There are many directions for further research. These
    24 include compiler and tools technology to automate the low-level
    25 programming tasks inherent in building parallel bit stream
    26 applications, widening the research by applying the techniques
    27 to other forms of text analysis and parsing, and further
    28 investigation of the interaction between parallel bit
    29 stream technology and processor architecture.  Two promising
    30 avenues include investigation of GPGPU approaches to parallel
    31 bit stream technology and the leveraging of the intraregister parallelism
    32 inherent in this approach to also take advantage of the intrachip
    33 parallelism of multicore processors.
     12In this paper we presented Parabix a software runtime framework for
     13exploiting SIMD data units found on commodity processors for text
     14processing.  The Parabix framework allows to focus on exposing the
     15parallelism in their application assuming an infinite resource
     16abstract SIMD machine without worrying about or having to change code
     17to handle processor specifics (e.g., 128 bit SIMD SSE vs 256 bit SIMD
     18on AVX). We applied Parabix technology to a widely deployed
     19application; XML parsing and demonstrate the efficiency gains that can
     20be obtained on commodity processors. Compared to the conventional XML
     21parsers, Expat and Xerces, we achieve 2$\times$---7$\times$
     22improvement in performance and average x$\times$ improvement in
     23energy. We achieve high compute efficiency with an overall ?$\times$
     24reduction in branches, ?$\times$ reduction in branche mispredictions,
     25?%\times$ reduction in LLC misses, and increase in data parallelism
     26processing upto 128 characters with a single operation. We used the
     27Parabix framework and XML parsers to study the features of the new 256
     28bit AVX extension in Intel processors. We find that while the move to
     293-operand instructions deliver significant benefit the wider
     30operations in some cases have higher overheads compared to the
     31existing 128 bit SSE operations. We also compare Intel's SIMD
     32extensions against the ARM Neon. Note that Parabix allowed us to
     33perform these studies without having to change the application source.
     34Finally, we parallelized the Parabix XML parser to take advantage of
     35the SIMD units in every core on the chip. We demonstrate that the
     36benefits of thread-level-parallelism are complementary to the
     37fine-grain parallelism we exploit; parallelized Parabix achieves a
     38further 2$\times$ improvement in performance.
Note: See TracChangeset for help on using the changeset viewer.