# source:docs/HPCA2012/02-background.tex@1781

Last change on this file since 1781 was 1693, checked in by lindanl, 8 years ago

Minor changes.

File size: 13.4 KB
Line
1\section{Background}
2\label{section:background}
3
4% This section provides
5
6% Extensible Markup Language (XML) is a core technology standard
7% of the World Wide Web Consortium (W3C) that provides a common
8% framework for encoding and communicating structured information.
9% In applications ranging from Office Open XML in
10% Microsoft Office to NDFD XML of the NOAA National Weather
11% Service, from KML in Google Earth to Castor XML in the Martian Rovers,
12% from ebXML for e-commerce data interchange to RSS for news feeds
13% from web sites everywhere, XML plays a ubiquitous role in providing
14% a common framework for data interoperability world-wide and beyond.
15% As XML 1.0 editor Tim Bray is quoted in the W3C celebration of XML at 10 years,
16% "there is essentially no computer in the world, desk-top, hand-held,
17% or back-room, that doesn't process XML sometimes."
18
19
20
21%
22% With this focus on performance however, relatively little attention
23% has been paid on reducing energy consumption in XML processing.  For example, in addressing
24% performance through multicore parallelism, one generally must
25% pay an energy price for performance gains because of the
26% increased processing required for synchronization.
27% This focus on reduction of energy consumption is a key topic in this
28% paper. We study the energy and performance
29% characteristics of several XML parsers across three generations
30% of x86-64 processor technology.  The parsers we consider are
31% the widely used byte-at-a-time parsers Expat and Xerces, as well the
32% Parabix1 and Parabix2 parsers based on parallel bit stream technology.
33% A compelling result is that the performance benefits of parallel bit stream technology
34% translate directly and proportionally to substantial energy savings.
35
36
37\subsection{XML}
38% In 1998, the W3C officially adopted XML as a standard for data interchange.
39% Today, XML can be found everywhere from mobile websites and cloud computing to server-side database engines.
40
41Extensible Markup Language (XML) is a core technology standard of the World Wide Web Consortium (W3C);
42it provides a common framework for encoding and communicating structured and semi-structured information.
43XML can represent virtually any type of information (i.e., content) in a descriptive fashion.
44XML markup encodes a description of an XML document's storage layout and logical structure.
45Since XML is intended to be human-readable, markup tags are often verbose by design \cite{TR:XML}.
46For example, Figure \ref{fig:sample_xml} provides a standard product list encapsulated within an XML document.
47All content is highlighted in bold. Anything that is not content is considered markup.
48
49\begin{figure}[h]
50
51{\footnotesize
52\begin{center}
53\begin{tabbing}
54\hspace*{1em} \= \hspace*{1em} \= \hspace*{1em} \= \hspace*{1em} \= \hspace*{1em} \= \\
55{\ttfamily{\verb<Products\verb>}}\+ \\
56{\ttfamily{\verb<Product ID=\verb"}}{\bfseries\ttfamily{0001}}{\ttfamily{\verb"\verb>}}\+ \\
57{\ttfamily{\verb<ProductName Lang=\verb"}}{\bfseries\ttfamily{English}}{\ttfamily{\verb"\verb>}}{\bfseries\ttfamily{Widget}}{\ttfamily{\verb</ProductName\verb> }}\\
58{\ttfamily{\verb<ProductName Lang=\verb"}}{\bfseries\ttfamily{French}}{\ttfamily{\verb"\verb>}}{\bfseries\ttfamily{Bitoniau}}{\ttfamily{\verb</ProductName\verb> }}\\
59{\ttfamily{\verb<Company\verb>}}{\bfseries\ttfamily{ABC}}{\ttfamily{\verb</Company\verb> }}\\
60{\ttfamily{\verb<Price\verb>}}{\bfseries\ttfamily{\$19.95}}{\ttfamily{\verb</Price\verb> }}\- \\ 61{\ttfamily{\verb</Product\verb> }}\- \\ 62{\ttfamily{\verb</Products\verb> }}\\ 63\end{tabbing} 64\end{center}} 65 66\caption{Sample XML Document}\label{fig:sample_xml} 67\end{figure} 68 69 70 71\subsection{XML Parsers} 72% However, textual data tends to consist of variable-length items in generally unpredictable patterns \cite{Cameron2010}. 73Traditional XML parsers process an XML document sequentially, a single 74byte-at-a-time, from the first to the last character in the source 75text. Each character is examined to distinguish between the 76XML-specific markup, such as a left angle bracket \verb<`', and the 77content held within the document. A logical {\it cursor} position denotes the current character under interpretation. As the parser moves its cursor through the document, it alternates 78between markup scanning, validation, and content processing 79operations. In other words, traditional XML parsers operate as 80finite-state machines that use byte comparisons to transition between 81data and metadata states. Each state transition indicates the context 82for subsequent characters. Unfortunately, textual data tends to 83consist of variable-length items sequenced in generally unpredictable 84patterns; thus any character could be a state transition until deemed 85otherwise. 86 87A major disadvantage of the sequential byte-at-a-time approach to XML 88parsing is that processing an XML document requires at least one 89conditional branch per byte of source text. For example, Xerces-C, 90which forms the foundation for widely deployed the Apache XML project 91\cite{xerces}, uses a series of nested switch statements and 92state-dependent flag tests to control the parsing logic of the 93program. Xerces's complex data dependent control flow requires between 946 -- 13 branches per byte of XML input, depending on the markup in 95the file (details in Section~\ref{section:XML-branches}). Cache 96utilization is also significantly reduced due to the manner in which 97markup and content must be scanned and buffered for future use. For 98instance, Xerces incurs$\sim$100 L1 cache misses per kilobyte (kB) of 99XML data. In general, while microarchitectural improvements may help the 100parser tide over some of these challenges (e.g., cache misses), the 101fundamental data and control flow in the parsers are ill suited for 102commodity processors and experience significant overhead. 103 104 105 106 107% Switch block logical 108% Branches 109% Cache utilization 110 111% \subsection {Parallel XML Parsing} 112% 113% 114% In general, parallel acceleration methods come in two forms: multithreaded approaches and 115% SIMD-based techniques. Multithreaded XML parsers take advantage of multiple cores via number of strategies. 116% Common strategies include preparsing the XML file to locate key partitioning points \cite{ZhangPanChiu09} 117% and speculative p-DFAs \cite{ZhangPanChiu09}. 118 119% To accelerate XML parsing performance, most of the recent 120% work has focused on parallelization through the use of multicore 121% parallelism for chip multiprocessors \cite{ZhangPanChiu09, ParaDOM2009, LiWangLiuLi2009}, 122 123 124% SIMD XML parsers leverage the SIMD registers to overcome the 125% performance limitations of the sequential byte-at-a-time processing model and its inherently data dependent 126% branch misprediction rates. Further, data parallel SIMD instructions allow the processor to perform the same 127% operation on multiple pieces of data simultaneously. The Parabix1 and Parabix2 parsers studied in this paper 128% fall under the SIMD classification. The Parabix parser versions studied are described in further detail in 129% Section \ref{section:parabix}. 130 131% while SIMD (Single Instruction Multiple Data) parallelism 132% has been of interest to Intel in designing new SIMD instructions\cite{XMLSSE42} 133% , as well as to the developers of parallel bit stream technology 134% \cite{CameronHerdyLin2008, Cameron2009, Cameron2010}. 135% Each of these approaches has shown considerable performance 136% benefits over traditional sequential parsing techniques that follow the 137% byte-at-a-time model. 138 139%\subsection {SIMD Operations} 140% Two such SIMD XML parsers, Parabix1 and Parabix2, utilizes parallel bit stream processing technology. 141% Extract section 2.2 and merge into 3. Add a new subsection 142% in section 2 saying a bit about SIMD. Say a bit about pure SIMD vertical 143% operations and then mention the pack operations that allow 144% us to implement transposition efficiently in parallel. 145% Also note that the SIMD registers support bitwise logic across 146% their full width and that this is extensively used in our work. 147% \subsection{Parallel XML Parsing} 148% 149% Parallel XML processing generally comes in one of two forms: multithreading and SIMD. Multithreaded XML parsers take advantage of parallelism by first quickly preparsing the XML file to locate the key markup entities and determine the best workload distribution in which process the XML file using$n$-cores \cite{ZhangPanChiu09}. SIMD XML parsers leverage the SIMD registers to overcome the performance limitations of the sequential paradigm and inherently data dependent branch misprediction rates \cite{Cameron2010}. Two such SIMD XML parsers, Parabix1 and Parabix2, utilizes parallel bit stream processing technology. With this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position within the character code units (bytes). These bit streams are then loaded into SIMD registers of width$W$(e.g., 64-bit, 128-bit, 256-bit, etc). This allows$W\$ consecutive code units to be represented and processed at once. Bitwise logic and shift operations, bit scans, population counts and other bit-based operations are then used to carry out the work in parallel \cite{CameronLin2009}.
150%
151% \subsubsection{Parabix1}
152%
153% Our first generation parallel bit stream XML parser---Parabix1---uses employs a less conventional approach of SIMD technology to represent text in parallel bit streams. Bits of each stream are in one-to-one-correspondence with the bytes of a character stream. A transposition step first transforms sequential byte stream data into eight basis bit streams for the bits of each byte. Bitwise logical combinations of these basis bit streams can then be used to classify bytes in various ways, while the bit scan operations common to commodity processors can be used for fast sequential scanning. At a high level, Parabix1 processes source XML in a functionally equivalent manner as a traditional processor. That is, Parabix1 moves sequentially through the source document, maintaining a single cursor scanning position throughout the parse. However, this scanning operation itself is accelerated significantly which leads to dramatic performance improvements, since bit scan operations can perform up to general register width (32-bit, 64-bit) finite state transitions per clock cycle. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, CameronLin2009, Cameron2010}.
154%
155% \subsubsection{Parabix2}
156%
157% In our second generation XML parser---Parabix2---we address the replacement of sequential parsing using bit scan instructions with a parallel parsing method using bitstream addition. Unlike the single cursor approach of Parabix1 and conceptually of traditional sequential approach, in Parabix2 multiple cursors positions are processed in parallel. To deal with these parallel cursors, three additional categories of bit streams are introduced. Marker bit streams are used to represent positions of interest in the parsing of a source data stream \cite{Cameron2010}. The appearance of a 1 at a position in a marker bitstream could, for example, denote the starting position an XML tag in the data stream. In general, the set of bit positions in a marker bitstream may be considered to be the current parsing positions of multiple parses taking place in parallel throughout the source data stream. A further aspect of the parallel method is that conditional branch statements used to identify syntax error at each each parsing position are eliminated. Instead, error bit streams are used to identify the position of parsing or well-formedness errors during the parsing process. Error positions are gathered and processed in as a final post processing step. Hence, Parabix2 offers additional parallelism over Parabix1 in the form of multiple cursor parsing as well as significanlty reduces branch misprediction penalty.
158
159%
160% --- Not Sure ---
161%
162% The first two parsers employ traditional byte-at-a-time methods of
163% parsing; these parsers were selected based on their popularity in the
164% marketplace and the availability of source code for deeper analysis.
165% The Fluke i410 current clamp is a digital multimeter that reads the
166% magnetic field of a live electrical cable to determine the current
167% passing through it without affecting the underlying hardware.
168
169% \textbf{No Need to talk about resource usage until  later}
170
171% In order to determine how and which performance factors influence energy consumption,
172% we intend to use the Fluke i410 current clamp in conjunction with PMCs to compare the per parser invocation and per source XML byte energy % usage of three XML parsers:
173% Expat 2.0.1, Xerces-C++ 3.1.1 (SAX2), and Parabix2. All three parsers are C/C++ based, event-driven, stream-oriented XML parsers.
174
176%
177%
178% Need gory details on byte at a time parsers. Pictures.
179% Xerces. Explain overall data flow and control flow for these parsers.
180% Briefly highlight inefficiencies.
181%
182%
183%
184% Talk about the usage of SIMD instructions and how it might help. Lead
185% onto briefly describe the key technology behind parabix.
186
187%