source: docs/HPCA2012/final_ieee/02-background.tex

Last change on this file was 1783, checked in by ashriram, 8 years ago

Final pass

File size: 13.4 KB
4% This section provides
6% Extensible Markup Language (XML) is a core technology standard
7% of the World Wide Web Consortium (W3C) that provides a common
8% framework for encoding and communicating structured information. 
9% In applications ranging from Office Open XML in
10% Microsoft Office to NDFD XML of the NOAA National Weather
11% Service, from KML in Google Earth to Castor XML in the Martian Rovers,
12% from ebXML for e-commerce data interchange to RSS for news feeds
13% from web sites everywhere, XML plays a ubiquitous role in providing
14% a common framework for data interoperability world-wide and beyond.
15% As XML 1.0 editor Tim Bray is quoted in the W3C celebration of XML at 10 years,
16% "there is essentially no computer in the world, desk-top, hand-held,
17% or back-room, that doesn't process XML sometimes."
22% With this focus on performance however, relatively little attention
23% has been paid on reducing energy consumption in XML processing.  For example, in addressing
24% performance through multicore parallelism, one generally must
25% pay an energy price for performance gains because of the
26% increased processing required for synchronization.   
27% This focus on reduction of energy consumption is a key topic in this
28% paper. We study the energy and performance
29% characteristics of several XML parsers across three generations
30% of x86-64 processor technology.  The parsers we consider are
31% the widely used byte-at-a-time parsers Expat and Xerces, as well the
32% Parabix1 and Parabix2 parsers based on parallel bit stream technology. 
33% A compelling result is that the performance benefits of parallel bit stream technology
34% translate directly and proportionally to substantial energy savings.
37% In 1998, the W3C officially adopted XML as a standard for data interchange.
38% Today, XML can be found everywhere from mobile websites and cloud computing to server-side database engines.
40Extensible Markup Language (XML) is a core technology standard of the World Wide Web Consortium (W3C);
41it provides a common framework for encoding and communicating structured and semi-structured information. 
42XML can represent virtually any type of information (i.e., content) in a descriptive fashion.
43XML markup encodes a description of an XML document's storage layout and logical structure.
44Since XML is intended to be human-readable, markup tags are often verbose by design \cite{TR:XML}.
45For example, Figure \ref{fig:sample_xml} provides a standard product list encapsulated within an XML document.
46All content is highlighted in bold. Anything that is not content is considered markup.
53\hspace*{1em} \= \hspace*{1em} \= \hspace*{1em} \= \hspace*{1em} \= \hspace*{1em} \= \\
54{\ttfamily{\verb`<`Products\verb`>`}}\+ \\
55{\ttfamily{\verb`<`Product ID=\verb`"`}}{\bfseries\ttfamily{0001}}{\ttfamily{\verb`"`\verb`>`}}\+ \\
56{\ttfamily{\verb`<`ProductName Lang=\verb`"`}}{\bfseries\ttfamily{English}}{\ttfamily{\verb`"`\verb`>`}}{\bfseries\ttfamily{Widget}}{\ttfamily{\verb`<`/ProductName\verb`>` }}\\
57{\ttfamily{\verb`<`ProductName Lang=\verb`"`}}{\bfseries\ttfamily{French}}{\ttfamily{\verb`"`\verb`>`}}{\bfseries\ttfamily{Bitoniau}}{\ttfamily{\verb`<`/ProductName\verb`>` }}\\
58{\ttfamily{\verb`<`Company\verb`>`}}{\bfseries\ttfamily{ABC}}{\ttfamily{\verb`<`/Company\verb`>` }}\\
59{\ttfamily{\verb`<`Price\verb`>`}}{\bfseries\ttfamily{\$19.95}}{\ttfamily{\verb`<`/Price\verb`>` }}\- \\
60{\ttfamily{\verb`<`/Product\verb`>` }}\- \\
61{\ttfamily{\verb`<`/Products\verb`>` }}\\
65\caption{Sample XML Document}\label{fig:sample_xml}
70\subsection{XML Parsers}
71% However, textual data tends to consist of variable-length items in generally unpredictable patterns \cite{Cameron2010}.
72Traditional XML parsers process an XML document sequentially, a single
73byte-at-a-time, from the first to the last character in the source
74text.  Each character is examined to distinguish between the
75XML-specific markup, such as a left angle bracket `\verb`<`', and the
76content held within the document.  A logical {\it cursor} position denotes the current character under interpretation. As the parser moves its cursor through the document, it alternates
77between markup scanning, validation, and content processing
78operations.  In other words, traditional XML parsers operate as
79finite-state machines that use byte comparisons to transition between
80data and metadata states. Each state transition indicates the context
81for subsequent characters. Unfortunately, textual data tends to
82consist of variable-length items sequenced in generally unpredictable
83patterns; thus any character could be a state transition until deemed
86A major disadvantage of the sequential byte-at-a-time approach to XML
87parsing is that processing an XML document requires at least one
88conditional branch per byte of source text.  For example, Xerces-C,
89which forms the foundation for the widely deployed the Apache XML project
90\cite{xerces}, uses a series of nested switch statements and
91state-dependent flag tests to control the parsing logic of the
92program. Xerces's complex data dependent control flow requires between
936--13 branches per byte of XML input, depending on the markup in
94the file (details in Section~\ref{section:XML-branches}).  Cache
95utilization is also significantly reduced due to the manner in which
96markup and content must be scanned and buffered for future use.  For
97instance, Xerces incurs $\sim$100 L1 cache misses per kilobyte (kB) of
98XML data. In general, while microarchitectural improvements may help the
99parser tide over some of these challenges (e.g., cache misses), the
100fundamental data and control flow in byte-at-a-time parsers are ill suited for
101commodity processors and experience significant overhead.
106% Switch block logical
107% Branches
108% Cache utilization
110% \subsection {Parallel XML Parsing}
113% In general, parallel acceleration methods come in two forms: multithreaded approaches and
114% SIMD-based techniques. Multithreaded XML parsers take advantage of multiple cores via number of strategies.
115% Common strategies include preparsing the XML file to locate key partitioning points \cite{ZhangPanChiu09}
116% and speculative p-DFAs \cite{ZhangPanChiu09}.
118% To accelerate XML parsing performance, most of the recent
119% work has focused on parallelization through the use of multicore
120% parallelism for chip multiprocessors \cite{ZhangPanChiu09, ParaDOM2009, LiWangLiuLi2009},
123% SIMD XML parsers leverage the SIMD registers to overcome the
124% performance limitations of the sequential byte-at-a-time processing model and its inherently data dependent
125% branch misprediction rates.  Further, data parallel SIMD instructions allow the processor to perform the same
126% operation on multiple pieces of data simultaneously.  The Parabix1 and Parabix2 parsers studied in this paper
127% fall under the SIMD classification. The Parabix parser versions studied are described in further detail in
128% Section \ref{section:parabix}.
130% while SIMD (Single Instruction Multiple Data) parallelism
131% has been of interest to Intel in designing new SIMD instructions\cite{XMLSSE42}
132% , as well as to the developers of parallel bit stream technology
133% \cite{CameronHerdyLin2008, Cameron2009, Cameron2010}.
134% Each of these approaches has shown considerable performance
135% benefits over traditional sequential parsing techniques that follow the
136% byte-at-a-time model.
138%\subsection {SIMD Operations}
139% Two such SIMD XML parsers, Parabix1 and Parabix2, utilizes parallel bit stream processing technology.
140% Extract section 2.2 and merge into 3.   Add a new subsection
141% in section 2 saying a bit about SIMD.   Say a bit about pure SIMD vertical
142% operations and then mention the pack operations that allow
143% us to implement transposition efficiently in parallel. 
144% Also note that the SIMD registers support bitwise logic across
145% their full width and that this is extensively used in our work.
146% \subsection{Parallel XML Parsing}
148% Parallel XML processing generally comes in one of two forms: multithreading and SIMD. Multithreaded XML parsers take advantage of parallelism by first quickly preparsing the XML file to locate the key markup entities and determine the best workload distribution in which process the XML file using $n$-cores \cite{ZhangPanChiu09}. SIMD XML parsers leverage the SIMD registers to overcome the performance limitations of the sequential paradigm and inherently data dependent branch misprediction rates \cite{Cameron2010}. Two such SIMD XML parsers, Parabix1 and Parabix2, utilizes parallel bit stream processing technology. With this method, byte-oriented character data is first transposed to eight parallel bit streams, one for each bit position within the character code units (bytes). These bit streams are then loaded into SIMD registers of width $W$ (e.g., 64-bit, 128-bit, 256-bit, etc). This allows $W$ consecutive code units to be represented and processed at once. Bitwise logic and shift operations, bit scans, population counts and other bit-based operations are then used to carry out the work in parallel \cite{CameronLin2009}.
150% \subsubsection{Parabix1}
152% Our first generation parallel bit stream XML parser---Parabix1---uses employs a less conventional approach of SIMD technology to represent text in parallel bit streams. Bits of each stream are in one-to-one-correspondence with the bytes of a character stream. A transposition step first transforms sequential byte stream data into eight basis bit streams for the bits of each byte. Bitwise logical combinations of these basis bit streams can then be used to classify bytes in various ways, while the bit scan operations common to commodity processors can be used for fast sequential scanning. At a high level, Parabix1 processes source XML in a functionally equivalent manner as a traditional processor. That is, Parabix1 moves sequentially through the source document, maintaining a single cursor scanning position throughout the parse. However, this scanning operation itself is accelerated significantly which leads to dramatic performance improvements, since bit scan operations can perform up to general register width (32-bit, 64-bit) finite state transitions per clock cycle. This approach has recently been applied to Unicode transcoding and XML parsing to good effect, with research prototypes showing substantial speed-ups over even the best of byte-at-a-time alternatives \cite{CameronHerdyLin2008, CameronLin2009, Cameron2010}.
154% \subsubsection{Parabix2}
156% In our second generation XML parser---Parabix2---we address the replacement of sequential parsing using bit scan instructions with a parallel parsing method using bitstream addition. Unlike the single cursor approach of Parabix1 and conceptually of traditional sequential approach, in Parabix2 multiple cursors positions are processed in parallel. To deal with these parallel cursors, three additional categories of bit streams are introduced. Marker bit streams are used to represent positions of interest in the parsing of a source data stream \cite{Cameron2010}. The appearance of a 1 at a position in a marker bitstream could, for example, denote the starting position an XML tag in the data stream. In general, the set of bit positions in a marker bitstream may be considered to be the current parsing positions of multiple parses taking place in parallel throughout the source data stream. A further aspect of the parallel method is that conditional branch statements used to identify syntax error at each each parsing position are eliminated. Instead, error bit streams are used to identify the position of parsing or well-formedness errors during the parsing process. Error positions are gathered and processed in as a final post processing step. Hence, Parabix2 offers additional parallelism over Parabix1 in the form of multiple cursor parsing as well as significanlty reduces branch misprediction penalty.
159% --- Not Sure ---
161% The first two parsers employ traditional byte-at-a-time methods of
162% parsing; these parsers were selected based on their popularity in the
163% marketplace and the availability of source code for deeper analysis.
164% The Fluke i410 current clamp is a digital multimeter that reads the
165% magnetic field of a live electrical cable to determine the current
166% passing through it without affecting the underlying hardware.
168% \textbf{No Need to talk about resource usage until  later}
170% In order to determine how and which performance factors influence energy consumption,
171% we intend to use the Fluke i410 current clamp in conjunction with PMCs to compare the per parser invocation and per source XML byte energy % usage of three XML parsers:
172% Expat 2.0.1, Xerces-C++ 3.1.1 (SAX2), and Parabix2. All three parsers are C/C++ based, event-driven, stream-oriented XML parsers.
174% Think the person reading doesn't know much about XML parsers.
177% Need gory details on byte at a time parsers. Pictures.
178% Xerces. Explain overall data flow and control flow for these parsers.
179% Briefly highlight inefficiencies.
183% Talk about the usage of SIMD instructions and how it might help. Lead
184% onto briefly describe the key technology behind parabix.
187% --- Section Advice ---
189% 1. Think the person reading doesn't know much about XML parsers.
191% 2. Need gory details on byte at a time parsers. Pictures.
192% Xerces. Explain overall dataflow and control flow for these parsers.
193% Briefly highlight inefficiencies.
195% 3. Talk about the usage of SIMD instructions and how it might help. Lead
196% onto briefly describe the key technology behind parabix.
Note: See TracBrowser for help on using the repository browser.