Changeset 1393


Ignore:
Timestamp:
Aug 30, 2011, 10:47:59 AM (8 years ago)
Author:
ashriram
Message:

Minor bug fixes up to 04

Location:
docs/HPCA2012
Files:
9 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/00-abstract.tex

    r1360 r1393  
    11In modern applications text files are employed widely. For example,
    22XML files provide data storage in human readable format and are
    3 ubiquitous in applications ranging from database systems to mobile phone
    4 SDKs.  Traditional text processing tools are built around a
     3ubiquitous in applications ranging from database systems to mobile
     4phone SDKs.  Traditional text processing tools are built around a
    55byte-at-a-time processing model where each character token of a
    66document is examined. The byte-at-a-time model is highly challenging
     
    88input-dependent branches which cause pipeline squashes and
    99stalls. Furthermore, typical text processing tools perform few
    10 operations per processed character and experience high cache miss
     10operations per character and experience high cache miss
    1111rates. Overall, parsing text in important domains like XML processing
    12 requires high performance motivating hardware designers to adopt ASIC
     12requires high performance motivating the adoption of custom hardware
    1313solutions.
    1414
     
    2121In this paper, we enable text processing applications to effectively
    2222use commodity processors. We introduce Parabix (Parallel Bitstream)
    23 technology, a software toolkit and execution framework that allows
    24 applications to exploit modern SIMD instructions extensions for high
    25 performance text processing. Parabix enables the application developer
    26 to write constructs assuming unlimited SIMD data parallelism and
    27 Parabix's bitstream translator generates code based on machine specifics
    28 (e.g., SIMD register widths).  The key insight into efficient text
    29 processing in Parabix is the data organization. Parabix transposes the
    30 sequence of character bytes into sets of 8 parallel bit streams which
    31 then enables us to operate on multiple characters with single
    32 bit-parallel SIMD operators. We demonstrate the features and
    33 efficiency of Parabix with a XML parsing application. We evaluate a
    34 Parabix-based XML parser against two widely used XML parsers, Expat
    35 and Apache's Xerces, and across three generations of x86 processors,
    36 including the new Intel \SB{}.  We show that Parabix's speedup is
    37 2$\times$--7$\times$ over Expat and Xerces. We observe that Parabix
    38 overall makes efficient use of intra-core parallel hardware on
    39 commodity processors and supports significant gains in energy. Using
    40 Parabix, we assess the scalability advantages of SIMD processor
    41 improvements across Intel processor generations, culminating with a
    42 look at the latest 256-bit AVX technology in \SB{} versus the now
    43 legacy 128-bit SSE technology. We also examine Parabix on mobile platforms
    44 using ARM processors with Neon SIMD extensions.  Finally, we partition the XML
    45 program into pipeline stages and demonstrate that thread-level
    46 parallelism exploits SIMD units scattered across the different cores
    47 and improves performance (2$\times$ on 4 cores) at same energy levels
    48 as the single-thread version.
     23technology, a software toolchain and execution framework that allows
     24applications to exploit modern SIMD instructions for high performance
     25text processing. Parabix enables the application developer to write
     26constructs assuming unlimited SIMD data parallelism and Parabix's
     27bitstream translator generates code based on machine specifics (e.g.,
     28SIMD register widths).  The key insight into efficient text processing
     29in Parabix is the data organization. Parabix transposes the sequence
     30of character bytes into sets of 8 parallel bit streams which then
     31enables us to operate on multiple characters with bit-parallel SIMD
     32operations. We demonstrate the features and efficiency of Parabix with
     33an XML parsing application. We evaluate the Parabix-based parser
     34against two widely used XML parsers, Expat and Apache's
     35Xerces. Parabix makes efficient use of intra-core SIMD hardware and
     36demonstrates 2$\times$--7$\times$ speedup and 4$\times$ improvement in
     37energy efficiency compared to the conventional parsers. We assess the
     38scalability of SIMD implementations across three generations of x86
     39processors including the new \SB{}. We compare the 256-bit AVX
     40technology in Intel \SB{} versus the now legacy 128-bit SSE technology
     41and analyze the benefits and challenges of using the AVX
     42extensions.  Finally, we partition the XML program into pipeline stages
     43and demonstrate that thread-level parallelism enables the application
     44to exploits SIMD units scattered across the different cores and
     45improves performance (2$\times$ on 4 cores) at same energy levels as
     46the single-thread version for the XML application.
    4947
    5048
  • docs/HPCA2012/01-intro.tex

    r1373 r1393  
    11\section{Introduction}
    22We have now long since reached the limit to classical Dennard voltage
    3 scaling that enabled us to keep all of transistors afforded by
    4 Moore's law active. This has already resulted in a rethink of the way
     3scaling that enabled us to keep all of transistors afforded by Moore's
     4law active. This has already resulted in a rethink of the way
    55general-purpose processors are built: processor frequencies have
    66remained stagnant over the last 5 years with the capability to boost
    7 core speeds on Intel multicores only if other cores on the chip are
    8 shut off. Chip makers strive to achieve energy efficient computing by
    9 operating at more optimal core frequencies and aim to increase
    10 performance with a larger number of cores. Unfortunately, given the
    11 limited levels of parallelism that can be found in
    12 applications~\cite{blake-isca-2010}, it is not certain how many cores
    13 can be productively used in scaling our
    14 chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting
    15 parallelism across multiple cores tends to require heavy weight threads
    16 that are difficult to manage and synchronize.
     7a core's frequency only if other cores on the chip are shut off. Chip makers
     8strive to achieve energy efficient computing by operating at more
     9optimal core frequencies and aim to increase performance with a larger
     10number of cores. Unfortunately, given the limited levels of
     11parallelism that can be found in applications~\cite{blake-isca-2010},
     12it is not certain how many cores can be productively used in scaling
     13our chips~\cite{esmaeilzadeh-isca-2011}. This is because exploiting
     14parallelism across multiple cores tends to require heavy weight
     15threads that are difficult to manage and synchronize.
    1716
    1817The desire to improve the overall efficiency of computing is pushing
     
    3332In this paper, we tackle the infamous ``thirteenth dwarf''
    3433(parsers/finite state machines) that is considered to be the hardest
    35 application class to parallelize~\cite{Asanovic:EECS-2006-183} and
    36 show how Parabix, a novel software architecture, tool chain and
    37 run-time environment can indeed be used to dramatically improve
    38 parsing efficiency on commodity processors.  Based on the concept of
    39 transposing byte-oriented character data into parallel bit streams for
    40 the individual bits of each byte, the Parabix framework exploits the
    41 SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM) 
     34application class to parallelize~\cite{Asanovic:EECS-2006-183}. We
     35present Parabix, a novel execution framework and software run-time
     36environment that can be used to dramatically improve the efficiency of
     37text processing and parsing on commodity processors.  Parabix
     38transposes byte-oriented character data into parallel bit streams for
     39the individual bits of each character byte and then exploits the
     40SIMD extensions on commodity processors (SSE/AVX on x86, Neon on ARM)
    4241to process hundreds of character positions in an input stream
    4342simultaneously.  To achieve transposition, Parabix exploits
     
    5756applications ranging from Office Open XML in Microsoft Office to NDFD
    5857XML of the NOAA National Weather Service, from KML in Google Earth to
    59 Castor XML in the Martian Rovers, as well as ubiquitous XML data in Android phones.  XML
    60 parsing efficiency is important for multiple application areas; in
    61 server workloads the key focus in on overall transactions per second,
    62 while in applications in network switches and cell phones, latency
    63 and energy are of paramount importance.  Traditional
    64 software-based XML parsers have many inefficiencies including
    65 considerable branch misprediction penalties due to complex
    66 input-dependent branching structures as well as poor use of memory bandwidth and
    67 data caches due to byte-at-a-time processing and multiple buffering.
    68 XML ASIC chips have been around for over 6 years, but typically lag
    69 behind CPUs in technology due to cost constraints. Our focus is how
    70 much we can improve performance of the XML parser on commodity
    71 processors with Parabix technology.
     58Castor XML in the Martian Rovers.  XML parsing efficiency is important
     59for multiple application areas; in server workloads the key focus in
     60on overall transactions per second, while in applications in network
     61switches and cell phones, latency and energy are of paramount
     62importance.  Traditional software-based XML parsers have many
     63inefficiencies including considerable branch misprediction penalties
     64due to complex input-dependent branching structures as well as poor
     65use of memory bandwidth and data caches due to byte-at-a-time
     66processing and multiple buffering.  XML ASIC chips have been around
     67for over 6 years, but typically lag behind CPUs in technology due to
     68cost constraints. Our focus is how much we can improve performance of
     69the XML parser on commodity processors with Parabix technology.
    7270
    7371In the end, as summarized by
     
    9290parsing.
    9391
    94 2) We compare  Parabix XML parsers against conventional parsers
    95 and assess the improvement in overall performance and energy efficiency
    96 on each platform.   We are the first to compare and contrast SSE/AVX extensions across
    97 multiple generation of Intel processors and show that there are
    98 performance challenges when using newer generation SIMD extensions,
    99 possibly due to their memory interface. We compare the ARM Neon
    100 extensions against the x86 SIMD extensions and comment on the latency of
    101 SIMD operations across these architectures.
     922) We compare Parabix XML parsers against conventional parsers and
     93assess the improvement in overall performance and energy efficiency on
     94each platform.  We are the first to compare and contrast SSE/AVX
     95extensions across multiple generation of Intel processors and show
     96that there are performance challenges when using newer generation SIMD
     97extensions. We compare the ARM Neon extensions against the x86 SIMD
     98extensions and comment on the latency of SIMD operations across these
     99architectures.
    102100
    1031013) Finally, building on the SIMD parallelism of Parabix technology,
     
    154152
    155153The remainder of this paper is organized as follows.
    156 Section~\ref{section:background} presents background material on XML parsing
    157 and provides insight into the inefficiency of traditional parsers on
    158 mainstream processors.  Section~\ref{section:parabix} describes the
    159 Parabix architecture, tool chain and run-time environment.
    160 Section~\ref{section:parser} describes the application of the
    161 Parabix framework to the construction of an XML parser
    162 enforcing all the well-formedness rules of the XML
    163 specification.  Section~\ref{section:methodology} then describes
    164 the overall methodology of our performance and energy study.
    165 Section~\ref{section:baseline} presents a detailed
    166 performance evaluation on a \CITHREE\ processor as
    167 our primary evaluation platform, addressing a number of
    168 microarchitectural issues including cache misses, branch
    169 mispredictions, and SIMD instruction counts.  Section~\ref{section:scalability} examines
    170 scalability and performance gains through three generations of Intel
    171 architecture.  Section~\ref{section:avx} examines the extension
    172 of the Parabix technology to take advantage of Intel's new
    173 256-bit AVX technology, while Section~\ref{section:neon} investigates
    174 the applications of this technology on mobile platforms using
    175 ARM processors with Neon SIMD extensions.
    176 Section~\ref{section:multithread} then looks at the multithreading of the
    177 Parabix XML parser using pipeline parallelism.
    178 Section~\ref{section:related} discusses related work, after which
    179 Section~\ref{section:conclusion} concludes the paper.
     154Section~\ref{section:background} presents background material on XML
     155parsing and provides insight into the inefficiency of traditional
     156parsers on mainstream processors.  Section~\ref{section:parabix}
     157describes the Parabix architecture, tool chain and run-time
     158environment.  Section~\ref{section:parser} describes the application
     159of the Parabix framework to the construction of an XML parser
     160enforcing all the well-formedness rules of the XML specification.
     161Section~\ref{section:baseline} presents a detailed performance
     162analysis of Parabix on a \CITHREE\ system using hardware performance
     163counters and compares it against conventional parsers.
     164Section~\ref{section:scalability} compares the performance and energy
     165efficiency of 128 bit SIMD extensions across three generations of
     166intel processors and includes a comparison with the ARM Cortex-A8
     167processor.  Section~\ref{section:avx} examines the Intel's new 256-bit
     168AVX technology and comments on the benefits and challenges compared to
     169the 128-bit SSE instructions.  Finally,
     170Section~\ref{section:multithread} looks at the multithreading of the
     171Parabix XML parser which seeks to exploit the SIMD units scattered
     172across multiple cores.
    180173
    181174
  • docs/HPCA2012/02-background.tex

    r1374 r1393  
    4343XML can represent virtually any type of information (i.e., content) in a descriptive fashion.
    4444XML markup encodes a description of an XML document's storage layout and logical structure.
    45 Because XML is intended to be human-readable, XML markup tags are often verbose by design \cite{TR:XML}.
     45Since XML is intended to be human-readable, markup tags are often verbose by design \cite{TR:XML}.
    4646For example, Figure \ref{fig:sample_xml} provides a standard product list encapsulated within an XML document.
    4747All content is highlighted in bold. Anything that is not content is considered markup.
     
    4949\begin{figure}[h]
    5050
    51 {\scriptsize
     51{\footnotesize
    5252\begin{center}
    5353\begin{tabbing}
     
    7171\subsection{XML Parsers}
    7272% However, textual data tends to consist of variable-length items in generally unpredictable patterns \cite{Cameron2010}.
    73 Traditional XML parsers process an XML document sequentially, a single byte-at-a-time,
    74 from the first to the last character in the source text.
    75 Each character is examined to distinguish between the XML-specific markup,
    76 such as an opening angle bracket `\verb`<`', and the content held within the document.
    77 The character that the parser is currently processing is commonly referred to its {\it cursor position}. As the
    78 parser moves its cursor through the source document, it alternates between markup scanning / validation and
    79 content processing operations.
    80 In other words,
    81 traditional XML parsers operate as finite-state machines that use byte comparisons to transition
    82 between data and metadata states. Each state transition indicates the context in which to interpret the
    83 subsequent characters. Unfortunately, textual data tends to consist of variable-length items sequenced in
    84 generally unpredictable patterns; thus any character could be a state transition until
    85 deemed otherwise.
    86 
    87 A major disadvantage of the sequential byte-at-a-time approach to XML parsing is that processing an XML document
    88 requires at least one conditional branch per byte of source text.
    89 For example, Xerces-C, which is the most popular open-source C++ based XML parser and the foundation of the
    90 Apache XML project \cite{xerces},
    91 uses a series of nested switch statements and state-dependent flag tests to control the parsing logic of the program.
    92 Our analysis, which we detail in Section \ref{section:XML-branches}, found that Xerces-C requires between 6 - 13 branches per byte
    93 of XML to support this form of control structure (depending on the ratio of markup vs. content).
    94 Cache utilization is also significantly reduced due to the manner in which markup and content must be scanned and buffered
    95 for future use.
    96 For instance, Xerces-C incurs $\sim$100 L1 cache misses per 1000 bytes of XML.
    97 % Branch mispredictions are known
    98 % to signficantly degrade XML parsing performance in proportion to the markup density of the source document
    99 % \cite{CameronHerdyLin2008} (i.e., the ratio of markup vs. content).
    100 
    101 
     73Traditional XML parsers process an XML document sequentially, a single
     74byte-at-a-time, from the first to the last character in the source
     75text.  Each character is examined to distinguish between the
     76XML-specific markup, such as an left angle bracket `\verb`<`', and the
     77content held within the document.  The character that the parser is
     78currently interpreting is commonly referred to its {\it cursor}. As
     79the parser moves its cursor through the document, it alternates
     80between markup scanning, validation, and content processing
     81operations.  In other words, traditional XML parsers operate as
     82finite-state machines that use byte comparisons to transition between
     83data and metadata states. Each state transition indicates the context
     84for subsequent characters. Unfortunately, textual data tends to
     85consist of variable-length items sequenced in generally unpredictable
     86patterns; thus any character could be a state transition until deemed
     87otherwise.
     88
     89A major disadvantage of the sequential byte-at-a-time approach to XML
     90parsing is that processing an XML document requires at least one
     91conditional branch per byte of source text.  For example, Xerces-C,
     92which forms the foundation for widely deployed the Apache XML project
     93\cite{xerces}, uses a series of nested switch statements and
     94state-dependent flag tests to control the parsing logic of the
     95program.  Our analysis, which we detail in Section
     96\ref{section:XML-branches}, found that Xerces requires between 6 - 13
     97branches per byte of XML to support this form of control flow,
     98depending on the fraction of markup in the overall document.  Cache
     99utilization is also significantly reduced due to the manner in which
     100markup and content must be scanned and buffered for future use.  For
     101instance, Xerces incurs $\sim$100 L1 cache misses per 1000 bytes of
     102XML.  In general, while microarchitectural improvements may help the
     103parser tide over some of these challenges (e.g., cache misses), the
     104fundamental data and control flow in the parsers are ill suited for
     105commodity processors and experience significant overhead.
    102106
    103107
  • docs/HPCA2012/03-research.tex

    r1384 r1393  
    22\label{section:parabix}
    33
    4 This section presents an overview of the SIMD-based parallel bit stream text processing framework, \emph{Parabix}.
    5 The framework has three components: a unifying architectural view of text processing in terms
    6 of parallel bit streams, a tool chain for automating the generation of parallel bit stream
    7 code from higher-level specifications, and a run-time environment providing a portable
    8 SIMD programming abstraction, independent of the specific facilities available
    9 on particular target architectures.
     4This section presents an overview of the SIMD-based parallel bit
     5stream text processing framework, \emph{Parabix}.  The framework has
     6three components: a unifying architectural view of text processing in
     7terms of parallel bit streams, a tool chain for automating the
     8generation of parallel bit stream code from higher-level
     9specifications, and a run-time environment providing a portable SIMD
     10programming abstraction, independent of the specific facilities
     11available on particular target architectures.
    1012
    1113
    1214\subsection{Parallel Bit Streams}
    1315
    14 The fundamental difference between the Parabix framework and traditional text processing models is in how
    15 Parabix represents the source data.   Given a traditional byte-oriented text stream, Parabix
    16 first transposes the text data to a transform domain consisting of 8 parallel bit streams, known
    17 as {\em basis bit streams}.  In essence, each basis bit stream $b_{k}$ represents the stream
    18 of $k$-th bit of each byte in the source text.  That is, the $k$-th bit of $i$-th byte in the source
    19 text is in the $i$-th (bit) position of the $k$-th basis bit stream, $b_{k}$.
    20 For example, in Figure \ref{fig:BitstreamsExample}, we show how the ASCII string ``{\ttfamily b7\verb`<`A}'' is represented as 8 basis bit
    21 streams, $\tt b_{0 \ldots 7}$. The bits used to construct $\tt b_7$ have been highlighted in this example.
     16The fundamental difference between the Parabix framework and
     17traditional text processing models is in how Parabix represents the
     18source data.  Given a traditional byte-oriented text stream, Parabix
     19first transposes the text data to a transform domain consisting of 8
     20parallel bit streams, known as {\em basis bit streams}.  In essence,
     21each basis bit stream $b_{k}$ represents the stream of $k$-th bit of
     22each byte in the source text.  That is, the $k$-th bit of $i$-th byte
     23in the source text is in the $i$-th (bit) position of the $k$-th basis
     24bit stream, $b_{k}$.  For example, in Figure\ref{fig:BitstreamsExample}, we show how the ASCII string ``{\ttfamily
     25  b7\verb`<`A}'' is represented as 8 basis bit streams, $\tt b_{0
     26  \ldots 7}$. The bits used to construct $\tt b_7$ have been
     27highlighted in this example.
    2228
    2329\begin{figure}[h]
     
    4652
    4753The advantage of the parallel bit stream representation is that we can
    48 use the 128-bit SIMD registers commonly found on commodity processors (e.g. SSE on
    49 Intel) to process 128 byte positions at a time using bitwise logic, shifting and
    50 other operations.
     54use the 128-bit SIMD registers commonly found on commodity processors
     55(e.g. SSE on Intel) to process 128 byte positions at a time using
     56bitwise logic, shifting and other operations.
    5157
    5258Just as forward and inverse Fourier transforms are used to transform
    53 between the time and frequency domains in signal processing, bit stream
    54 transposition and inverse transposition provides ``byte space'' and ``bit space''
    55 views of text.  The goal of the Parabix framework is to support efficient
    56 text processing using these two equivalent representations in the same
    57 way that efficient signal processing benefits from the use of the frequency
    58 domain in some cases and the time domain in others.
     59between the time and frequency domains in signal processing, bit
     60stream transposition and inverse transposition provides ``byte space''
     61and ``bit space'' views of text.  The goal of the Parabix framework is
     62to support efficient text processing using these two equivalent
     63representations in the same way that efficient signal processing
     64benefits from the use of the frequency domain in some cases and the
     65time domain in others.
    5966
    6067In the Parabix framework, basis bit streams are used as the starting
    61 point to determine other bit streams.   In particular, Parabix uses the basis bit streams to
    62 construct \emph{character-class bit streams} in which each $\tt 1$ bit indicates the presense
    63 of a significant character (or class of characters) in the parsing process.
    64 Character-class bit streams may then be used to compute \emph{lexical bit streams} and \emph{error bit streams}, which
    65 Parabix uses to process and validate the source document. The remainder of this section will discuss each type of bit stream.
     68point to determine other bit streams.  In particular, Parabix uses the
     69basis bit streams to construct \emph{character-class bit streams} in
     70which each $\tt 1$ bit indicates the presense of a significant
     71character (or class of characters) in the parsing process.
     72Character-class bit streams may then be used to compute \emph{lexical
     73  bit streams} and \emph{error bit streams}, which Parabix uses to
     74process and validate the source document. The remainder of this
     75section will discuss each type of bit stream.
    6676
    6777{\bf Basis Bit Streams:}
     
    7080so that Parabix can efficiently produce the character-class bit streams.
    7181Using the SIMD capabilities of current commodity processors, the transposition process incurs an amortized cost
    72 of approximately 1 cycle per byte \cite{CameronHerdyLin2008}.
     82of approximately 1 cycle per byte.
    7383%The size of $k$ is dependent on the code unit size of the text encoding format of the source document. A code unit is simply
    7484%a fixed number of bits (or bytes) used to represent a specific character (code point). Some encoding formats, such
     
    8797
    8898{\bf Character-class Bit Streams:}
    89 Typically, as text parsers process input data, they locate specific characters to determine if and when to transition between
    90 data and metadata parsing.
     99Typically, as text parsers process input data, they locate specific
     100characters to determine if and when to transition between data and
     101metadata parsing.
    91102% For example, in a CSV file, any `,' or `\textbackslash n' can indicate the start of a new column or row respectively.
    92 For example, in XML, any opening angle bracket character, `\verb`<`', may indicate that we are starting a new markup tag.
    93 Traditional byte-at-a-time parsers find these characters by comparing the value of each code point with a set
    94 of known code points and branching appropriately when one is found, typically using an if or switch statement.
    95 Using this method to perform multiple transitions in parallel is non-trivial and may require fairly sophisticated algorithms
    96 to do so correctly.
     103For example, in XML, any opening angle bracket character, `\verb`<`',
     104may indicate that we are starting a new markup tag.  Traditional
     105byte-at-a-time parsers find these characters by comparing the value of
     106each code point with a set of known code points and branching
     107appropriately when one is found, typically using an if or switch
     108statement.  Using this method to perform multiple transitions in
     109parallel is non-trivial and may require fairly sophisticated
     110algorithms to do so correctly.
    97111% However, a `<' is legal within an XML comment so not every `<' necessarily means that we are opening a new tag.
    98112
    99 Character-class bit streams allow us to perform up to 128 ``comparisons'' in parallel with a single operation
    100 by using a series of boolean-logic operations
    101 \footnote{$\land$, $\lor$ and $\lnot$ denote the boolean AND, OR and NOT operations.}
    102 to merge multiple basis bit streams into
    103 a single character-class stream that marks the positions of key characters with a $1$.
    104 For example, a character is an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.
    105 Classes of characters can be found with similar formulas.
    106 For example, a character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.
    107 An important observation here is that a range of characters can sometimes take fewer operations and require fewer basis bit streams to compute
    108 than individual characters.
    109 Finding optimal solutions to all character-classes is non-trivial and goes beyond the scope of this paper.
     113Character-class bit streams allow us to perform up to 128
     114``comparisons'' in parallel with a single operation by using a series
     115of boolean-logic operations \footnote{$\land$, $\lor$ and $\lnot$
     116  denote the boolean AND, OR and NOT operations.}  to merge multiple
     117basis bit streams into a single character-class stream that marks the
     118positions of key characters with a $1$.  For example, a character is
     119an `\verb`<`' if and only if $\lnot(b_ 0 \lor b_1) \land (b_2 \land
     120b_3 \land b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1$.  Classes of
     121characters can be found with similar formulas.  For example, a
     122character is a number {\tt [0-9]} if and only if $\lnot(b_0 \lor b_1)
     123\land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6))$.  An
     124important observation here is that a range of characters can sometimes
     125take fewer operations and require fewer basis bit streams to compute
     126than individual characters.  Finding optimal solutions to all
     127character-classes is non-trivial and goes beyond the scope of this
     128paper.
    110129
    111130% The advantage of character-class bit streams over traditional ``byte streams'' is that
     
    115134
    116135
    117 {\bf Lexical and Error Bit Streams:}
    118 To perform lexical analysis on the input data, Parabix computes lexical and error bit streams from the character-class bit streams using
    119 a mixture of both boolean logic and integer math. Lexical bit streams typically mark multiple current parsing positions.
    120 Unlike the single-cursor approach of traditional text parsers, these allow Parabix to process multiple cursors in parallel.
    121 Error bit streams are often the byproduct or derivative of computing lexical bit streams and can be used to identify any well-formedness
    122 issues found during the parsing process. The presense of a $\tt 1$ bit in an error stream tends to mean that the lexical stream cannot be
    123 trusted to be completely accurate and Parabix may need to perform some sequential parsing on that section to determine the cause and severity
    124 of the error.
    125 
    126 To form lexical bit streams, we have to introduce a few new operations: Advance and ScanThru.
    127 The $\texttt{Advance}$ operator accepts one input parameter, $c$, which is typically viewed as a bit stream containing multiple cursor bits,
    128 and advances each cursor one position forward.  On little-endian architectures, shifting forward means shifting to the right.
    129 $\texttt{ScanThru}$ accepts two input parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved to first subsequent
    130 $\tt 0$-bit in $m$ by calculating $(c + m) \land \lnot m$. 
    131 For example, in Figure \ref{fig:ParabixParsingExample} suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to find all
    132 instances of it in the source text.
    133 We begin by constructing the character classes $C_{0}$, which consists of all letters, $C_{1}$, which contains all `\verb`>`'s, and
    134 $C_{2}$, which marks all `\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by one to locate the first character of each
    135 token. By computing $E_{0}$, the parser notes that ``\verb`<>`'' does not match the expected pattern. To find the end positions of each token,
    136 the parser calculates $L_{1}$ by moving the cursors in $L_0$ through the letter bits in $C_0$. $L_1$ is then validated to ensure that each
    137 token ends with a `\verb`>`' and discovers that ``\verb`<error]`'' too fails to match the expected pattern.
    138 With additional post bit-stream processing, the erroneous cursor positions in $L_{0}$ and $L_{1}$ can be removed or ignored; the details
    139 of which go beyond the scope of this paper.
     136{\bf Lexical and Error Bit Streams:} To perform lexical analysis on
     137the input data, Parabix computes lexical and error bit streams from
     138the character-class bit streams using a mixture of both boolean logic
     139and integer math. Lexical bit streams typically mark multiple current
     140parsing positions.  Unlike the single-cursor approach of traditional
     141text parsers, these allow Parabix to process multiple cursors in
     142parallel.  Error bit streams are derived from the lexical bit streams
     143and can be used to identify any well-formedness issues found during
     144the parsing process. The presense of a $\tt 1$ bit in an error stream
     145tends to mean that the lexical stream cannot be trusted to be
     146completely accurate and Parabix may need to perform some sequential
     147parsing on that section to determine the cause and severity of the
     148error.
     149
     150To form lexical bit streams, we have to introduce a few new
     151operations: Advance and ScanThru.  The $\texttt{Advance}$ operator
     152accepts one input parameter, $c$, which is typically viewed as a bit
     153stream containing multiple cursor bits, and advances each cursor one
     154position forward.  On little-endian architectures, shifting forward
     155means shifting to the right.  $\texttt{ScanThru}$ accepts two input
     156parameters, $c$ and $m$; any bit that is in both $c$ and $m$ is moved
     157to first subsequent $\tt 0$-bit in $m$ by calculating $(c + m) \land
     158\lnot m$.  For example, in Figure \ref{fig:ParabixParsingExample}
     159suppose we have the regular expression \verb`<[a-zA-Z]+>` and wish to
     160find all instances of it in the source text.  We begin by constructing
     161the character classes $C_{0}$, which consists of all letters, $C_{1}$,
     162which contains all `\verb`>`'s, and $C_{2}$, which marks all
     163`\verb`<`'s. In $L_0$ the position of every `\verb`<`' is advanced by
     164one to locate the first character of each token. By computing $E_{0}$,
     165the parser notes that ``\verb`<>`'' does not match the expected
     166pattern. To find the end positions of each token, the parser
     167calculates $L_{1}$ by moving the cursors in $L_0$ through the letter
     168bits in $C_0$. $L_1$ is then validated to ensure that each token ends
     169with a `\verb`>`' and discovers that ``\verb`<error]`'' too fails to
     170match the expected pattern.  With additional post bit stream
     171processing, the erroneous cursor positions in $L_{0}$ and $L_{1}$ can
     172be removed or ignored; the details of which go beyond the scope of
     173this paper.
    140174
    141175\begin{figure}[h]
     
    171205% details, refer to the technical report \cite{Cameron2010}.
    172206
    173 Using this parallel bit stream approach, conditional branch statements used to identify key positions and/or syntax errors at each
    174 each parsing position are mostly eliminated, which, as Section \ref{section:XML-branches} shows, minimizes branch misprediction penalties.
    175 Accurate parsing and parallel lexical analysis is done through processor-friendly equations and requires neither speculation nor multithreading.
    176 
    177 
    178 \subsection{Parabix Tool Chain}
     207Using this parallel bit stream approach, conditional branch statements
     208used to identify key positions and/or syntax errors at each each
     209parsing position are mostly eliminated, which, as Section
     210\ref{section:XML-branches} shows, minimizes branch misprediction
     211penalties.  Accurate parsing and parallel lexical analysis is done
     212through processor-friendly equations and requires neither speculation
     213nor multithreading.
     214
     215
     216\subsection{Parabix Compilers}
    179217\label{parabix tool chain}
    180218
    181 To support the Parabix framework, we are developing tool technology to automate various aspects
    182 of parallel bit stream programming. At present, our tool chain consists of two compilers: a character class
    183 compiler (ccc) and a unbounded bit stream to C/C++ block-at-a-time processing compiler (Pablo).
    184 
    185 The character class compiler is used to automatically produce bit stream logic for all the individual characters (e.g.,
    186 delimiters) and character classes (e.g., digits, letters) used in a particular application.
    187 Input is specified using a character class syntax adapted from the standard regular expression notations.
    188 Output is a minimized set of three-address bitwise operations, such as $a = b~\&~c$,
    189 to compute each of the character classes from the basis bit streams.
    190 
    191 For example, Figure \ref{fig:CCC} shows the input and output produced by the
    192 character class compiler for the example of \verb`[0-9]`
    193 discussed in the previous section.  The output operations
    194 may be viewed as operations on a single block of input
    195 at a time, or may be viewed as operations on unbounded bitstreams
    196 as supported by the Pablo compiler.
     219To support the Parabix execution framework, we have developed a tool
     220chain to the automate various aspects of parallel bit stream
     221programming. Our tool chain consists of two compilers: a
     222character class compiler (\textit{ccc}) and an unbounded bit stream to C/C++
     223block-at-a-time processing compiler (\textit{Pablo}).
     224
     225The character class compiler is used to automatically produce bit
     226stream logic for all the individual characters (e.g., delimiters) and
     227character classes (e.g., digits, letters) used in a particular
     228application.  Input is specified using a character class syntax
     229adapted from the standard regular expression notations.  Output is a
     230minimized set of three-address bitwise operations, such as $a =
     231b~\&~c$, to compute each of the character classes from the basis bit
     232streams.
     233
     234For example, Figure \ref{fig:CCC} shows the input and output produced
     235by the character class compiler for the example of \verb`[0-9]`
     236discussed in the previous section.  The output operations may be
     237viewed as operations on a single block of input at a time, or may be
     238viewed as operations on unbounded bit streams as supported by the Pablo
     239compiler.
    197240
    198241\begin{figure}[tbh]
     
    256299\end{figure}
    257300
    258 The Pablo compiler abstracts away the details of
    259 programming parallel bit stream code in terms of finite
    260 SIMD register widths and application buffer sizes.
    261 Input to Pablo is a language for expressing bitstream operations on unbounded bitstreams. 
    262 The operations include bitwise
    263 logic, the {\tt Advance} and {\tt ScanThru} operations described in he
    264 previous subsection as well as if and while control structures.
    265 Pablo translates these operations to block-at-a-time
    266 code in C/C++.
     301The Pablo compiler abstracts away the details of programming parallel
     302bit stream code in terms of finite SIMD register widths and
     303application buffer sizes.  Input to Pablo is a language for expressing
     304bitstream operations on unbounded bitstreams.  The operations include
     305bitwise logic, the {\tt Advance} and {\tt ScanThru} operations
     306described in the previous subsection as well as ``if-then'' and
     307``while {}'' control structures.  Pablo translates these operations to
     308block-at-a-time code in C/C++.
    267309%, where the block size is the register width for SIMD operations on the selected target architecture.
    268310The key functionality of Pablo is to arrange for block-to-block
     
    271313{\tt Advance} and {\tt ScanThru}.
    272314
    273 For example, we can translate the simple parsing example
    274 of \ref{fig:ParabixParsingExample} above into Pablo code
    275 to produce the output as shown in Figure \ref{fig:Pablo}.
    276 In this example, Pablo has the primary responsibility of inserting
    277 carry variable declarations that allow the results of
    278 Advance and ScanThru operations to be carried over from
    279 block to block.  A separate carry variable is required for every
    280 {\tt Advance} or {\tt ScanThru} operation.  A function containing
    281 such operations is translated into a public C++ class (struct),
    282 which includes a Carry Queue to hold all the carry
    283 variables from iteration to iteration, together with the
    284 a method {\tt do\_block} to implement the processing
    285 for a single block (based on the SIMD register width).
    286 Macros {\tt CarryDeclare} and {\tt CarryInit} declare and
    287 initialize the Carry Queue structure depending on the
    288 specific architecture and Carry Queue representation.
    289 The unbounded bitstream {\tt Advance} and {\tt ScanThru}
    290 operations are translated into block-by-block equivalents
    291 with explicit carry-in and carry-out processing. 
    292 At the end of each block, the {\tt CarryQ\_Adjust}
    293 operation implements any necessary adjustment of the
    294 Carry Queue to prepare for the next iteration.
    295 The Pablo language and compiler also support conditional
    296 and iterative bitstream logic on unbounded streams
    297 (if and while constructs) which involves additional
    298 carry-test insertion in control branches.   Explaining the
    299 full details of the translation
    300 is beyond the scope of this paper, however.
     315For example, we can translate the simple parsing example of
     316\ref{fig:ParabixParsingExample} above into Pablo code to produce the
     317output as shown in Figure \ref{fig:Pablo}.  In this example, Pablo has
     318the primary responsibility of inserting carry variable declarations
     319that allow the results of Advance and ScanThru operations to be
     320carried over from block to block.  A separate carry variable is
     321required for every {\tt Advance} or {\tt ScanThru} operation.  A
     322function containing such operations is translated into a public C++
     323class (struct), which includes a carry queue to hold all the carry
     324variables from iteration to iteration, together with the a method {\tt
     325  do\_block} to implement the processing for a single block (based on
     326the SIMD register width).  Macros {\tt CarryDeclare} and {\tt
     327  CarryInit} declare and initialize the carry queue structure
     328depending on the specific architecture and carry queue representation.
     329The unbounded bitstream {\tt Advance} and {\tt ScanThru} operations
     330are translated into block-by-block equivalents with explicit carry-in
     331and carry-out processing.  At the end of each block, the {\tt
     332  CarryQ\_Adjust} operation implements any necessary adjustment of the
     333carry queue to prepare for the next iteration.  The Pablo language and
     334compiler also support conditional and iterative bitstream logic on
     335unbounded streams (if and while constructs) which involves additional
     336carry-test insertion in control branches.  Explaining the full details
     337of the translation is beyond the scope of this paper, however.
    301338
    302339\subsection{Parabix Run-Time Libraries}
    303340
    304 The Parabix architecture also includes run-time libraries
    305 that support a machine-independent view of basic SIMD
    306 operations, as well as a set of core function libraries.
    307 For machine-independence, we program all operations using
    308 an abstract SIMD machine based on the Inductive Doubling
    309 Instruction Set Architecture (IDISA) \cite{CameronLin2009}.
    310 The abstract machine supports all power-of-2 field widths up to the full
    311 SIMD register width on a target machine.   
    312 Let $w = 2k$ be the field width in bits. Let $f$ be a basic binary operation defined on $w$-bit quantities
    313 producing an $w$-bit result. Let $W$ be the SIMD vector size in bits where $W = 2K$.
    314 Then the C++ template notation \verb`v=simd<w>::f(a,b)` denotes the general pattern for a vertical SIMD operation yielding an output SIMD vector $v$,
    315 given two input SIMD vectors $a$ and $b$. For each field $v_i$ of $v$, the value computed is $f(a_i, b_i)$.
    316 For example, given 128-bit SIMD vectors, \verb`simd<8>::add(a,b)` represents the simultaneous addition of sixteen 8-bit fields.
    317 
    318 These operations were originally developed for 128-bit Altivec operations on Power PC
    319 as well as 64-bit MMX and 128-bit SSE operations on Intel
    320 but have recently extended to support
    321 the new 256-bit AVX operations on Intel as well as the 128-bit
    322 NEON operations on the ARM architecture.
     341The Parabix architecture also includes run-time libraries that support
     342a machine-independent view of basic SIMD operations, as well as a set
     343of core function libraries.  For machine-independence, we program all
     344operations using an abstract SIMD machine.  The abstract machine
     345supports all power-of-2 field widths up to the full SIMD register
     346width on a target machine.  Let $w = 2k$ be the field width in
     347bits. Let $f$ be a basic binary operation defined on $w$-bit
     348quantities producing an $w$-bit result. Let $W$ be the SIMD vector
     349size in bits where $W = 2K$.  Then the C++ template notation
     350\verb`v=simd<w>::f(a,b)` denotes the general pattern for a vertical
     351SIMD operation yielding an output SIMD vector $v$, given two input
     352SIMD vectors $a$ and $b$. For each field $v_i$ of $v$, the value
     353computed is $f(a_i, b_i)$.  For example, given 128-bit SIMD vectors,
     354\verb`simd<8>::add(a,b)` represents the simultaneous addition of
     355sixteen 8-bit fields.
     356
     357These operations were originally developed for 128-bit Altivec
     358operations on Power PC as well as 64-bit MMX and 128-bit SSE
     359operations on Intel but have recently extended to support the new
     360256-bit AVX operations on Intel as well as the 128-bit NEON operations
     361on the ARM architecture.
    323362 
  • docs/HPCA2012/03b-research.tex

    r1388 r1393  
    1010\end{figure}
    1111
    12 This section describes the implementation of the Parabix XML parser
    13 using the framework explained in Section \ref{section:parabix}.
     12This section describes the implementation of the Parabix XML parser.
    1413Figure \ref{parabix_arch} shows its overall structure set up for
    1514well-formedness checking.  The input file is
     
    3534consist entirely of parallel bit stream operations.  Of these, the
    3635Classification function consists of XML character class definitions
    37 that are generated using ccc, while much of the U8\_Validation
     36that are generated using our character class compiler \textit{ccc}, while much of the U8\_Validation
    3837similarly consists of UTF-8 byte class definitions that are also
    3938generated by ccc.  The remainder of these functions are programmed
  • docs/HPCA2012/04-methodology.tex

    r1378 r1393  
    44\paragraph{XML Parsers}\label{parsers}
    55
    6 We evaluate the Parabix XML parser described above
    7 against two widely available open-source
    8 parsers,  Xerces-C++, and Expat.
    9 Each of the parsers is evaluated on the task of implementing the
    10 parsing and well-formedness checking requirements of the full
    11 XML 1.0 specification\cite{TR:XML}.    Xerces-C++
    12 version 3.1.1 (SAX) \cite{xerces} is a validating open source XML
    13 parser written in C++ available as part of the the Apache project.
    14 However, we use the WFXML scanner of Xerces to avoid the costs
    15 of validation and also use the SAX interface to avoid the
    16 costs of DOM tree construction.
    17 Expat version 2.0.1 \cite{expat} is a non-validating XML parser
    18 library written in C.
     6We evaluate the Parabix XML parser described above against two widely
     7available open-source parsers, Xerces-C++, and Expat.  Each of the
     8parsers is evaluated on the task of implementing the parsing and
     9well-formedness checking requirements of the full XML 1.0
     10specification\cite{TR:XML}.  Xerces-C++ version 3.1.1 (SAX)
     11\cite{xerces} is a validating open source XML parser written in C++
     12available as part of the the Apache project.  To ensure a fair
     13comparison, we use the WFXML scanner of Xerces to eliminate the
     14overheads of validation and also use the SAX interface to avoid the
     15overheads costs of DOM tree construction.  Expat version 2.0.1
     16\cite{expat} is a non-validating XML parser library written in C.
    1917
    2018
     
    2826traditional recursive descent XML parser implementations.  We use a
    2927mixture of document-oriented and data-oriented XML files in our study
    30 to provide workloads with a full spectrum of markup densities.
     28to  analyze workloads with a full spectrum of markup densities.
    3129
    3230Table \ref{XMLDocChars} shows the document characteristics of the XML
     
    3634UTF-8 encoding of Japanese and German characters respectively.  The
    3735remaining data files are data-oriented XML documents and consist
    38 entirely of single byte $7$-bit encoded ASCII characters.
     36entirely of single byte encoded ASCII characters.
    3937
    4038\begin{table*}
  • docs/HPCA2012/06-scalability.tex

    r1389 r1393  
    138138\label{relative_performance_intel}
    139139}
    140 \caption{Parabix performance on mobile platforms}
     140\caption{Comparaing Parabix on ARM and Intel.}
    141141\end{figure}
    142142
  • docs/HPCA2012/main.tex

    r1380 r1393  
    160160%{$\{$ashriram,sandhya,kshen$\}$@cs.rochester.edu}
    161161%}
    162 \date{}
     162\def\iccvPaperID{160}
    163163\begin{document}
    164164\maketitle
Note: See TracChangeset for help on using the changeset viewer.