source: docs/HPCA2012/02-background.tex @ 1788

Last change on this file since 1788 was 1693, checked in by lindanl, 8 years ago

Minor changes.

File size: 13.4 KB
4% This section provides
6% Extensible Markup Language (XML) is a core technology standard
7% of the World Wide Web Consortium (W3C) that provides a common
8% framework for encoding and communicating structured information. 
9% In applications ranging from Office Open XML in
10% Microsoft Office to NDFD XML of the NOAA National Weather
11% Service, from KML in Google Earth to Castor XML in the Martian Rovers,
12% from ebXML for e-commerce data interchange to RSS for news feeds
13% from web sites everywhere, XML plays a ubiquitous role in providing
14% a common framework for data interoperability world-wide and beyond.
15% As XML 1.0 editor Tim Bray is quoted in the W3C celebration of XML at 10 years,
16% "there is essentially no computer in the world, desk-top, hand-held,
17% or back-room, that doesn't process XML sometimes."
22% With this focus on performance however, relatively little attention
23% has been paid on reducing energy consumption in XML processing.  For example, in addressing
24% performance through multicore parallelism, one generally must
25% pay an energy price for performance gains because of the
26% increased processing required for synchronization.   
27% This focus on reduction of energy consumption is a key topic in this
28% paper. We study the energy and performance
29% characteristics of several XML parsers across three generations
30% of x86-64 processor technology.  The parsers we consider are
31% the widely used byte-at-a-time parsers Expat and Xerces, as well the
32% Parabix1 and Parabix2 parsers based on parallel bit stream technology. 
33% A compelling result is that the performance benefits of parallel bit stream technology
34% translate directly and proportionally to substantial energy savings.
38% In 1998, the W3C officially adopted XML as a standard for data interchange.
39% Today, XML can be found everywhere from mobile websites and cloud computing to server-side database engines.
41Extensible Markup Language (XML) is a core technology standard of the World Wide Web Consortium (W3C);
42it provides a common framework for encoding and communicating structured and semi-structured information. 
43XML can represent virtually any type of information (i.e., content) in a descriptive fashion.
44XML markup encodes a description of an XML document's storage layout and logical structure.
45Since XML is intended to be human-readable, markup tags are often verbose by design \cite{TR:XML}.
46For example, Figure \ref{fig:sample_xml} provides a standard product list encapsulated within an XML document.
47All content is highlighted in bold. Anything that is not content is considered markup.
54\hspace*{1em} \= \hspace*{1em} \= \hspace*{1em} \= \hspace*{1em} \= \hspace*{1em} \= \\
55{\ttfamily{\verb`<`Products\verb`>`}}\+ \\
56{\ttfamily{\verb`<`Product ID=\verb`"`}}{\bfseries\ttfamily{0001}}{\ttfamily{\verb`"`\verb`>`}}\+ \\
57{\ttfamily{\verb`<`ProductName Lang=\verb`"`}}{\bfseries\ttfamily{English}}{\ttfamily{\verb`"`\verb`>`}}{\bfseries\ttfamily{Widget}}{\ttfamily{\verb`<`/ProductName\verb`>` }}\\
58{\ttfamily{\verb`<`ProductName Lang=\verb`"`}}{\bfseries\ttfamily{French}}{\ttfamily{\verb`"`\verb`>`}}{\bfseries\ttfamily{Bitoniau}}{\ttfamily{\verb`<`/ProductName\verb`>` }}\\
59{\ttfamily{\verb`<`Company\verb`>`}}{\bfseries\ttfamily{ABC}}{\ttfamily{\verb`<`/Company\verb`>` }}\\
60{\ttfamily{\verb`<`Price\verb`>`}}{\bfseries\ttfamily{\$19.95}}{\ttfamily{\verb`<`/Price\verb`>` }}\- \\
61{\ttfamily{\verb`<`/Product\verb`>` }}\- \\
62{\ttfamily{\verb`<`/Products\verb`>` }}\\
66\caption{Sample XML Document}\label{fig:sample_xml}
71\subsection{XML Parsers}
72% However, textual data tends to consist of variable-length items in generally unpredictable patterns \cite{Cameron2010}.
73Traditional XML parsers process an XML document sequentially, a single
74byte-at-a-time, from the first to the last character in the source
75text.  Each character is examined to distinguish between the
76XML-specific markup, such as a left angle bracket `\verb`<`', and the
77content held within the document.  A logical {\it cursor} position denotes the current character under interpretation. As the parser moves its cursor through the document, it alternates
78between markup scanning, validation, and content processing
79operations.  In other words, traditional XML parsers operate as
80finite-state machines that use byte comparisons to transition between
81data and metadata states. Each state transition indicates the context
82for subsequent characters. Unfortunately, textual data tends to
83consist of variable-length items sequenced in generally unpredictable
84patterns; thus any character could be a state transition until deemed
87A major disadvantage of the sequential byte-at-a-time approach to XML
88parsing is that processing an XML document requires at least one
89conditional branch per byte of source text.  For example, Xerces-C,
90which forms the foundation for widely deployed the Apache XML project
91\cite{xerces}, uses a series of nested switch statements and
92state-dependent flag tests to control the parsing logic of the
93program. Xerces's complex data dependent control flow requires between
946 -- 13 branches per byte of XML input, depending on the markup in
95the file (details in Section~\ref{section:XML-branches}).  Cache
96utilization is also significantly reduced due to the manner in which
97markup and content must be scanned and buffered for future use.  For
98instance, Xerces incurs $\sim$100 L1 cache misses per kilobyte (kB) of
99XML data. In general, while microarchitectural improvements may help the
100parser tide over some of these challenges (e.g., cache misses), the
101fundamental data and control flow in the parsers are ill suited for
102commodity processors and experience significant overhead.
107% Switch block logical
108% Branches
109% Cache utilization
111% \subsection {Parallel XML Parsing}
114% In general, parallel acceleration methods come in two forms: multithreaded approaches and
115% SIMD-based techniques. Multithreaded XML parsers take advantage of multiple cores via number of strategies.
116% Common strategies include preparsing the XML file to locate key partitioning points \cite{ZhangPanChiu09}
117% and speculative p-DFAs \cite{ZhangPanChiu09}.
119% To accelerate XML parsing performance, most of the recent
120% work has focused on parallelization through the use of multicore
121% parallelism for chip multiprocessors \cite{ZhangPanChiu09, ParaDOM2009, LiWangLiuLi2009},
124% SIMD XML parsers leverage the SIMD registers to overcome the
125% performance limitations of the sequential byte-at-a-time processing model and its inherently data dependent
126% branch misprediction rates.  Further, data parallel SIMD instructions allow the processor to perform the same
127% operation on multiple pieces of data simultaneously.  The Parabix1 and Parabix2 parsers studied in this paper
128% fall under the SIMD classification. The Parabix parser versions studied are described in further detail in
129% Section \ref{section:parabix}.
131% while SIMD (Single Instruction Multiple Data) parallelism
132% has been of interest to Intel in designing new SIMD instructions\cite{XMLSSE42}
133% , as well as to the developers of parallel bit stream technology
134% \cite{CameronHerdyLin2008, Cameron2009, Cameron2010}.
135% Each of these approaches has shown considerable performance
136% benefits over traditional sequential parsing techniques that follow the
137% byte-at-a-time model.
139%\subsection {SIMD Operations}
140% Two such SIMD XML parsers, Parabix1 and Parabix2, utilizes parallel bit stream processing technology.
141% Extract section 2.2 and merge into 3.   Add a new subsection
142% in section 2 saying a bit about SIMD.   Say a bit about pure SIMD vertical
143% operations and then mention the pack operations that allow
144% us to implement transposition efficiently in parallel. 
145% Also note that the SIMD registers support bitwise logic across
146% their full width and that this is extensively used in our work.
147% \subsection{Parallel XML Parsing}
149% Parallel XML processing generally comes in one of two forms: multithreading and SIMD. Multithreaded XML parsers take advantage of parallelism by first quickly preparsing the XML file to locate the key markup entities and determine the best workload distribution in which process the XML file using $n$-cores \cite{ZhangPanChiu09}. SIMD XML parsers leverage the SIMD registers to overcome the performance limitations of the sequential paradigm and inherently data dependent branch misprediction rates \cite{Cameron2010}. Two such SIMD XML parsers, Parabix1 and Parabix2, utilizes parallel bit stream processing technology. With this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position within the character code units (bytes). These bit streams are then loaded into SIMD registers of width $W$ (e.g., 64-bit, 128-bit, 256-bit, etc). This allows $W$ consecutive code units to be represented and processed at once. Bitwise logic and shift operations, bit scans, population counts and other bit-based operations are then used to carry out the work in parallel \cite{CameronLin2009}.
151% \subsubsection{Parabix1}
153% Our first generation parallel bit stream XML parser---Parabix1---uses employs a less conventional approach of SIMD technology to represent text in parallel bit streams. Bits of each stream are in one-to-one-correspondence with the bytes of a character stream. A transposition step first transforms sequential byte stream data into eight basis bit streams for the bits of each byte. Bitwise logical combinations of these basis bit streams can then be used to classify bytes in various ways, while the bit scan operations common to commodity processors can be used for fast sequential scanning. At a high level, Parabix1 processes source XML in a functionally equivalent manner as a traditional processor. That is, Parabix1 moves sequentially through the source document, maintaining a single cursor scanning position throughout the parse. However, this scanning operation itself is accelerated significantly which leads to dramatic performance improvements, since bit scan operations can perform up to general register width (32-bit, 64-bit) finite state transitions per clock cycle. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, CameronLin2009, Cameron2010}.
155% \subsubsection{Parabix2}
157% In our second generation XML parser---Parabix2---we address the replacement of sequential parsing using bit scan instructions with a parallel parsing method using bitstream addition. Unlike the single cursor approach of Parabix1 and conceptually of traditional sequential approach, in Parabix2 multiple cursors positions are processed in parallel. To deal with these parallel cursors, three additional categories of bit streams are introduced. Marker bit streams are used to represent positions of interest in the parsing of a source data stream \cite{Cameron2010}. The appearance of a 1 at a position in a marker bitstream could, for example, denote the starting position an XML tag in the data stream. In general, the set of bit positions in a marker bitstream may be considered to be the current parsing positions of multiple parses taking place in parallel throughout the source data stream. A further aspect of the parallel method is that conditional branch statements used to identify syntax error at each each parsing position are eliminated. Instead, error bit streams are used to identify the position of parsing or well-formedness errors during the parsing process. Error positions are gathered and processed in as a final post processing step. Hence, Parabix2 offers additional parallelism over Parabix1 in the form of multiple cursor parsing as well as significanlty reduces branch misprediction penalty.
160% --- Not Sure ---
162% The first two parsers employ traditional byte-at-a-time methods of
163% parsing; these parsers were selected based on their popularity in the
164% marketplace and the availability of source code for deeper analysis.
165% The Fluke i410 current clamp is a digital multimeter that reads the
166% magnetic field of a live electrical cable to determine the current
167% passing through it without affecting the underlying hardware.
169% \textbf{No Need to talk about resource usage until  later}
171% In order to determine how and which performance factors influence energy consumption,
172% we intend to use the Fluke i410 current clamp in conjunction with PMCs to compare the per parser invocation and per source XML byte energy % usage of three XML parsers:
173% Expat 2.0.1, Xerces-C++ 3.1.1 (SAX2), and Parabix2. All three parsers are C/C++ based, event-driven, stream-oriented XML parsers.
175% Think the person reading doesn't know much about XML parsers.
178% Need gory details on byte at a time parsers. Pictures.
179% Xerces. Explain overall data flow and control flow for these parsers.
180% Briefly highlight inefficiencies.
184% Talk about the usage of SIMD instructions and how it might help. Lead
185% onto briefly describe the key technology behind parabix.
188% --- Section Advice ---
190% 1. Think the person reading doesn't know much about XML parsers.
192% 2. Need gory details on byte at a time parsers. Pictures.
193% Xerces. Explain overall dataflow and control flow for these parsers.
194% Briefly highlight inefficiencies.
196% 3. Talk about the usage of SIMD instructions and how it might help. Lead
197% onto briefly describe the key technology behind parabix.
Note: See TracBrowser for help on using the repository browser.