Python/XML HOWTO
_________________________________________________________________
A.M. Kuchling
akuchlin@mems-exchange.org
Abstract:
XML is the eXtensible Markup Language, a subset of SGML intended to
allow the creation and processing of application-specific markup
languages. Python makes an excellent language for processing XML data.
This document is a tutorial for the Python/XML package. It assumes
you're already somewhat familiar with the structure and terminology of
XML, though a brief introduction is supplied.
Contents
* 1 Introduction to XML
+ 1.1 Elements, Attributes and Entities
+ 1.2 Well-Formed XML
+ 1.3 DTDs
* 2 XML-Related Standards
* 3 Installing the XML Toolkit
* 4 Package Overview
* 5 SAX: The Simple API for XML
+ 5.1 Starting Out
+ 5.2 Error Handling
+ 5.3 Searching Element Content
+ 5.4 Enabling Namespace Processing
* 6 DOM: The Document Object Model
+ 6.1 Getting A DOM Tree
+ 6.2 Printing The Tree
+ 6.3 Manipulating the Tree
+ 6.4 Creating New Nodes
+ 6.5 Walking Over The Entire Tree
* 7 XPath and XPointer
* 8 Marshalling Into XML
* 9 Acknowledgements
* About this document ...
1 Introduction to XML
XML, the eXtensible Markup Language, is a simplified dialect of SGML,
the Standardized General Markup Language. XML is intended to be
reasonably simple to implement and use, and is already being used for
specifying markup languages for various new standards: MathML for
expressing mathematical equations, Synchronized Multimedia Integration
Language for multimedia presentations, and so forth.
SGML and XML represent a document by tagging the document's various
components with their function or meaning. For example, a book
contains several parts: it has a title, one or more authors, the text
of the book, perhaps a preface or an index, and so forth. A markup
languge for writing books would therefore have elements indicating
what the contents of the preface are, what the title is, and so forth.
This logical structure should not be confused with the physical
details of how the document is actually printed on paper. The index
might be printed with narrow margins in a smaller font than the rest
of the book, but markup usually isn't (or shouldn't be, anyway)
concerned with details such as this. Instead, other software will
translate from the markup language to a typeset format, handling the
presentation details.
This section will provide a brief overview of XML and a few related
standards, but it's far from being complete because making it complete
would require a full-length book and not a short HOWTO. There's no
better way to get a completely accurate (if rather dry) description
than to read the original W3C Recommendations; you can find links to
them below. If you already know what XML is, you can skip the rest of
this section.
Later sections of this HOWTO assume that you're familiar with XML
terminology. Most sections will use XML terms such as element and
attribute. Section does not require that you have experience with any
of the various Java SAX implentations.
See Also:
Extensible Markup Language (XML) 1.0 (Second Edition)
For the full details of XML's syntax, the definitive source is
the XML 1.0 specification. However, like all specifications
it's quite formal and isn't intended to be a friendly
introduction or a tutorial. An annotated version of the
standard, is also available, and there are many more informal
tutorials and books available to introduce you to XML at
greater (or lesser) length.
The Annotated XML Specification
This annotated version of the XML specification, produced by
Tim Bray, is quite helpful in clarifying the specification's
intent. It is presented as a richly-hyperlinked document that
makes navigation easy, and evokes a sense of what hypertext was
meant to be.
The XML Cover Pages
An extensive collection of links to XML and SGML resources,
including a news page that's updated every few days. If you can
only remember one XML-related URL, remember this one. Cafe con
Leche is another good resource.
xml-dev mailing list
This is a high-traffic list for implementation and development
of XML standards. Be warned: Some people might find the
discussion too focused on vague theorizing about information
representation, and not on inventing new standards and tools or
applying existing standards.
1.1 Elements, Attributes and Entities
A markup language specified using XML looks a lot like HTML; a
document consists of a single element, which contains sub-elements,
which can have further sub-elements inside them. Elements are
indicated by tags in the text. Tags are always inside angle brackets
< >. Elements can either contain content, or they can be empty.
An element can contain content between opening and closing tags, as in
Euryale, which is a name element containing the data
"Euryale". This content may be text data, other XML elements, or a
mixture of both.
Elements can also be empty, containing nothing, and are represented as
a single tag ended with a slash. For example, is an empty stop
element. Unlike HTML, XML element names are case-sensitive; stop and
Stop are two different elements.
Opening and empty tags can also contain attributes, which specify
values associated with an element. For example, in the XML text Herakles, the name element has a lang attribute
which has a value of "greek". In Hercules,
the attribute's value is "latin".
XML also includes entities as a shorthand for including a particular
character or a longer string. Entity references always begin with a
"&" and end with a ";". For example, a particular Unicode character
can be written as ሴ using its character code in decimal, or as
ሴ using hexadecimal. It's also possible to define your own
entities, making &title; expand to ``The Odyssey'', for example. If
you want to include the "&" character in XML content, it must be
written as &.
1.2 Well-Formed XML
A legal XML document must, as a minimum, be well-formed: each opening
tag must have a corresponding closing tag, and tags must nest
properly. For example, text is not well-formed because
the i element should be enclosed inside the b element, but instead the
closing tag is encountered first. This example can be made
well-formed by swapping the order of the closing tags, resulting in
text.
If you've ever written HTML by hand, you may have acquired the habit
of being a bit sloppy about this. Strictly speaking HTML has exactly
the same rules about nesting tags as XML, but most Web browsers are
very forgiving of errors in HTML. This is convenient for HTML authors,
but it makes it difficult to write programs to parse HTML input
because the programs have to cope with all sorts of malformed input.
The authors of the XML specification didn't want XML to fall into the
same trap, because it would make XML processing software much harder
to write. Therefore, all XML parsers have to be strict and must report
an error if their input isn't well-formed. The Expat parser includes
an executable program named xmlwf that parses the contents of files
and reports any well-formedness violations; it's very handy for
checking XML data that's been output from a program or written by
hand.
1.3 DTDs
Well-formedness just says that all tags nest properly and that every
opening tag is matched by a closing tag. It says nothing about the
order of elements or about which elements can be contained inside
other elements.
The following XML, apparently representing a book, is well-formed but
it doesn't match the structure expected for a book:
...
...
...
...
...
...
Prefaces don't come at the end of books, the index doesn't belong at
the front, and the abstract doesn't belong in the middle.
Well-formedness alone doesn't provide any way of enforcing that order.
You could write a Python program that took an XML file like this and
checked whether all the parts are in order, but then someone wanting
to understand what documents are legal would have to read your
program.
Document Type Definitions, or DTDs for short, are a more concise way
of enforcing ordering and nesting rules. A DTD declares the element
names that are allowed, and how elements can be nested inside each
other. To take an example from HTML, the LI element, representing an
entry in a list, can only occur inside certain elements which
represent lists, such as OL or UL. The DTD also specifies the
attributes that can be provided for each element, the default value
for each attribute, and whether the attribute can be omitted. A
validating parser can take a document and a DTD, and check whether the
document is legal according to the DTD's rules. (The PyXML package
includes a validating parser called xmlproc.)
DTDs are therefore an example of a schema language, a language for
specifying a set of legal XML documents. Other applications want even
stricter control over which documents are legal, and there are
therefore stricter schema languages. XML Schema provides a type system
and a number of basic types, so you can say that the value of an
attribute must be a number or a date. RELAX NG is another schema
language that provides more power and flexibility than XML Schema, but
is simpler to read and implement.
Note that it's quite possible to get useful work done without using
any schema language at all. You might decide that just writing
well-formed XML and checking it with a Python program is all you need.
There's no reason to drag in a schema language if it won't be useful.
Let's return to DTDs. A DTD lists the supported elements, the order in
which elements must occur, and the possible attributes for each
element. Here's a fragment from an imaginary DTD for writing books:
The first line declares the book element, and specifies the elements
that can occur inside it and the order in which the subelements must
be provided. DTDs borrow from regular expression notation in order to
express how elements can be repeated; "?"means an element must occur 0
or 1 times, "*" is 0 or more times, and "+" means the element must
occur 1 or more times. For example, the above declarations imply that
the abstract and appendix elements are optional inside a book element.
Exactly one preface element has to be present, and it can be followed
by any number of chapter elements; having no chapters at all would be
legal.
The ATTLIST declaration specifies attributes for the chapter element.
Chapters can have two attributes, id and title. title contains
character data (CDATA) and is optional (that's what "#IMPLIED"means,
for obscure historical reasons). id must contain an ID value, and it's
required and not optional.
A validating parser could take this DTD and a sample document, and
report whether the document is valid according to the rules of the
DTD. A document is valid if all the elements occur in the right order,
and in the right number of repetitions.
2 XML-Related Standards
XML 1.0 is the basic standard, but people have built many, many
additional standards and tools on top of XML or to be used with XML.
This section will quickly introduce some of these related
technologies, paying particular attention to those that are supported
by the Python/XML package.
SAX
The Simple API for XML isn't a standard in the formal sense
that XML or ANSI C are. Rather, SAX is an informal
specification originally designed by David Megginson with input
from many people on the xml-dev mailing list. SAX defines an
event-driven interface for parsing XML. To use SAX, you must
create Python class instances which implement a specified
interface, and the parser will then call various methods on
those objects. See section 5.
DOM
The Document Object Model specifies a tree-based representation
for an XML document, as opposed to the event-driven processing
provided by SAX. See section 6.
Namespaces
One XML document can refer to elements from more than one DTD.
(Such documents can no longer be validated using DTDs, though
other schema languages such as RELAX NG can handle namespaces.)
For example, a document might contain both some text and a
diagram. The text might be represented using some elements from
the HTML DTD, and the diagram might use elements from the
Scalable Vector Graphics DTD. All the relevant modules in the
PyXML module can be used for namespace-aware processing.
XPath and XPointer
XPath is a language for referring to parts of an XML document.
With XPath you can refer to paragraph number N, or ``all
paragraphs of class "warning"'', or all chapters that have one
or more subsections. XPointer defines a way to use XPath
declarations as the fragment identifier in a URL to point at a
part of an XML document. See section 7.
XSLT
XSLT is a general tool for transforming one XML document into
another document, specifying the transformation using another
XML document called a stylesheet.
RDF
The Resource Description Format is for describing metadata
about other resources. The PyXML package doesn't contain any
support for RDF, but a Python library called Redfoot
(http://redfoot.sf.net) is available.
3 Installing the XML Toolkit
Releases are available from http://sourceforge.net/projects/pyxml/.
Windows users should download the appropriate precompiled version.
Linux users can either download an RPM, or install from source. Users
on other platfoms have no choice but to install from source.
To compile from source on a Unix platform, simply perform the
following steps.
1. Download the latest version of the source distribution from
http://sourceforge.net/projects/pyxml. Unpack it with the
following command.
gzip -dc xml-package.tgz | tar -xvf -
2. Run python setup.py install. In order to run this, you'll need to
have a C compiler installed, and it should be the same one that
was used to build your Python installation. On a Unix system, this
operation may require superuser permissions. setup.py supports a
number of different commands and options; invoke setup.py without
any arguments to see a help message.
If you have difficulty installing this software, send a problem report
to the XML-SIG mailing list describing the problem, or submit a bug
report at http://sourceforget.net/projects/pyxml.
One possible problem that some people encounter is a general issue of
managing a Python installation with 3rd-party compiled extensions: If,
when importing any of the C extensions provided with PyXML, you get an
error message saying "undefined symbol: PyUnicodeUCS2_"..., then you
are using a version of Python built using a 4-byte representation for
Unicode characters, and PyXML was built with a Python that used a
2-byte Unicode character. Conversely, if the error message give a
symbol name starting with PyUnicodeUCS4_ (note the different digit
near the end), the extension was built using a 4-byte Unicode
character, and Python was built using a 2-byte Unicode character. The
Python interpreter and all extension code need to be built using the
same size Unicode character representation.
There are various demonstration programs in the demo/ directory of the
Python/XML source distribution. You may wish to look at them to get an
idea of what's possible with the XML tools, and as a source of example
code.
See Also:
Python/XML Topic Guide
This Guide is the starting point for Python-related XML topics,
and includes links to software, mailing lists, documentation,
and other useful resources.
4 Package Overview
The PyXML package contains over 200 individual modules, some intended
for public use and some not. Many of these modules often perform
similar tasks, making it difficult to figure out which is the right
one to use in any given situation, and this can make it confusing.
Here's a list of the 30-odd packages and modules that are considered
public, along with brief descriptions to help you choose the right
one.
xml.dom
The Python DOM interface. The full interface support DOM Levels
1 and 2. xml.dom contains the implementation for DOM trees
built from XML documents. (This implementation is called 4DOM,
and was written by Fourthought Inc.)
xml.dom.html
DOM trees built from HTML documents are also supported.
xml.dom.javadom
An adaptor for using Java DOM implementations with Jython.
xml.dom.minidom
A lightweight DOM implementation that's also included in the
Python standard library.
xml.dom.minitraversal
Offers traversal and ranges on top of xml.dom.minidom, using
the 4DOM traversal implementation.
xml.dom.pulldom
Provides a stream of DOM elements. This module can make it easy
to write certain types of DTD-specific processing code.
xml.dom.xmlbuilder
General support for the experimental Document Object Model
(DOM) Level 3 Load and Save Specification. This currently only
supports the xml.dom.minidom DOM implementation.
xml.dom.ext
Various DOM-related extensions for pretty-printing DOM trees as
XML or XHTML.
xml.dom.ext.Dom2Sax
A parser to generate SAX events from a DOM tree.
xml.dom.ext.c14n
Takes a DOM tree and outputs a text stream containing the
Canonical XML representation of the document.
xml.dom.ext.reader
Classes for building DOM trees from various input sources: SAX1
and SAX2 parsers, htmllib, and directly using Expat.
xml.marshal.generic
Marshals simple Python data types into an XML format. The
Marshaller and Unmarshaller classes can be subclassed in order
to implement marshalling into a different XML DTD.
xml.marshal.wddx
Marshals Python objects into WDDX. (This module is built on top
of the preceding generic module.)
xml.ns
Contains constants for the namespace URIs for various
XML-related standards.
xml.parsers.sgmllib
A version of the sgmllib module that's part of the standard
Python library, rewritten to run on top of the sgmlop
accelerator module.
xml.parsers.xmlproc
A validating XML parser. Usually you'll want to use xmlproc via
SAX or some other higher-level interface.
xml.sax
SAX1 and SAX2 support for Python.
xml.sax.drivers
SAX1 drivers for various parsers: htmllib, LT, Expat, sgmllib,
xmllib, xmlproc, and XML-Toolkit.
xml.sax.drivers2
SAX2 drivers for various parsers: htmllib, Java SAX parsers
(for Jython), Expat, sgmllib, xmlproc.
xml.sax.handler
Contains the core SAX2 handler classes ContentHandler,
DTDHandler, EntityResolver, and ErrorHandler. Also contains
symbolic names for the various SAX2 features and properties.
xml.sax.sax2exts
SAX2 extensions. This contains various factory classes that
create parser objects, and is how SAX2 parsers are used.
xml.sax.saxlib
Contains two SAX2 handler classes, DeclHandler and
LexicalHandler, and the XMLFilter interface. Also contains the
deprecated SAX1 handler classes.
xml.sax.saxutils
Various utility classes, such as DefaultHandler, a default base
class for SAX2 handlers, ErrorPrinter and ErrorRaiser, two
default error handlers, and XMLGenerator, which generates XML
output from a SAX2 event stream.
xml.sax.xmlreader
Contains the XMLReader, the base interface for implementing
SAX2 parsers.
xml.schema.trex
A Python implementation of TREX, a schema language.
xml.utils.characters
Contains the legal XML character ranges as specified in the XML
1.0 Recommendation, and regular expressions that match various
XML tokens.
xml.utils.iso8601
Parses ISO-8601 date/time specifiers, which look like
"2002-05-09T20:40Z".
xml.utils.qp_xml
A simple tree-based XML parsing interface.
xml.xpath
An XPath parser and evaluator. (This implementation is called
4XPath, and was written by Fourthought Inc.)
5 SAX: The Simple API for XML
This HOWTO describes version 2 of SAX (also referred to as SAX2).
Support is still present for SAX version 1, which is now only of
historical interest; SAX1 will not be documented here.
SAX is most suitable for purposes where you want to read through an
entire XML document from beginning to end, and perform some
computation such as building a data structure or summarizing the
contained information (computing an average value of a certain
element, for example). SAX is not very convenient if you want to
modify the document structure by changing how elements are nested,
though it would be straightforward to write a SAX program that simply
changed element contents or attributes. For example, you wouldn't want
to re-order chapters in a book using SAX, but you might want to
extract the contents of all name elements with the attribute lang set
to 'greek'.
One advantage of SAX is speed and simplicity. Let's say you've defined
a complicated DTD for listing comic books, and you wish to scan
through your collection and list everything written by Neil Gaiman.
For this specialized task, there's no need to expend effort examining
elements for artists and editors and colourists, because they're
irrelevant to the search. You can therefore write a class instance
which ignores all elements that aren't writer.
Another advantage of SAX is that you don't have the whole document
resident in memory at any one time, which matters if you are
processing really huge documents.
SAX defines 4 basic interfaces. A SAX-compliant XML parser can be
passed any objects that support these interfaces, and will call
various methods as data is processed. Your task, therefore, is to
implement those interfaces that are relevant to your application.
The SAX interfaces are:
Interface Purpose
ContentHandler Called for general document events. This interface is
the heart of SAX; its methods are called for the start of the
document, the start and end of elements, and for the characters of
data contained inside elements.
DTDHandler Called to handle DTD events required for basic parsing.
This means notation declarations (XML spec section 4.7) and unparsed
entity declarations (XML spec section 4).
EntityResolver Called to resolve references to external entities. If
your documents will have no external entity references, you don't need
to implement this interface.
ErrorHandler Called for error handling. The parser will call methods
from this interface to report all warnings and errors.
Python doesn't support the concept of interfaces, so the interfaces
listed above are implemented as Python classes. The default method
implementations are defined to do nothing--the method body is just a
Python pass statement--so usually you can simply ignore methods that
aren't relevant to your application.
Pseudo-code for using SAX looks something like this:
# Define your specialized handler classes
from xml.sax import ContentHandler, ...
class docHandler(ContentHandler):
...
# Create an instance of the handler classes
dh = docHandler()
# Create an XML parser
parser = ...
# Tell the parser to use your handler instance
parser.setContentHandler(dh)
# Parse the file; your handler's methods will get called
parser.parse(sys.stdin)
See Also:
The SAX Home Page
This website has the most recent copy of the specification, and
lists SAX implementations for various languages and platforms.
Much of the information is somewhat Java-centric, though.
5.1 Starting Out
Let's follow the earlier example of a comic book collection, using a
simple DTD-less format. Here's a sample document for a collection
consisting of a single issue:
Neil Gaiman
Glyn Dillon
Charles Vess
An XML document must have a single root element; this is the
"collection" element. It has one child comic element for each issue;
the book's title and number are given as attributes of the comic
element. The comic element can in turn contain several other elements
such as writer and penciller listing the writer and artists
responsible for the issue. There may be several artists or writers for
a single issue.
Let's start off with something simple: a document handler named
FindIssue that reports whether a given issue is in the collection.
from xml.sax import saxutils
class FindIssue(saxutils.DefaultHandler):
def __init__(self, title, number):
self.search_title, self.search_number = title, number
The DefaultHandler class inherits from all four interfaces:
ContentHandler, DTDHandler, EntityResolver, and ErrorHandler. This is
what you should use if you want to just write a single class that
wraps up all the logic for your parsing. You could also subclass each
interface individually and implement separate classes for each
purpose. Neither of the two approaches is always ``better'' than the
other; mostly it's a matter of taste.
Since this class is doing a search, an instance needs to know what
it's searching for. The desired title and issue number are passed to
the FindIssue constructor, and stored as part of the instance.
Now let's override some of the parsing methods. This simple search
only requires looking at the attributes of a given element, so only
the startElement method is relevant.
def startElement(self, name, attrs):
# If it's not a comic element, ignore it
if name != 'comic': return
# Look for the title and number attributes (see text)
title = attrs.get('title', None)
number = attrs.get('number', None)
if (title == self.search_title and
number == self.search_number):
print title, '#' + str(number), 'found'
The startElement() method is passed a string giving the name of the
element, and an instance containing the element's attributes.
Attributes are accessed using methods from the AttributeList
interface, which includes most of the semantics of Python
dictionaries.
To summarize, the startElement() method looks for comic elements and
compares the specified title and number attributes to the search
values. If they match, a message is printed out.
startElement() is called for every single element in the document. If
you added print 'Starting element:', name to the top of
startElement(), you would get the following output.
Starting element: collection
Starting element: comic
Starting element: writer
Starting element: penciller
Starting element: penciller
To actually use the class, we need top-level code that creates
instances of a parser and of FindIssue, associates the parser and the
handler, and then calls a parser method to process the input.
from xml.sax import make_parser
from xml.sax.handler import feature_namespaces
if __name__ == '__main__':
# Create a parser
parser = make_parser()
# Tell the parser we are not interested in XML namespaces
parser.setFeature(feature_namespaces, 0)
# Create the handler
dh = FindIssue('Sandman', '62')
# Tell the parser to use our handler
parser.setContentHandler(dh)
# Parse the input
parser.parse(file)
The make_parser class can automate the job of creating parsers. There
are already several XML parsers available to Python, and more might be
added in future. xmllib.py is included as part of the Python standard
library, so it's always available, but it's also not particularly
fast. A faster version of xmllib.py is included in xml.parsers. The
xml.parsers.expat module is faster still, so it's obviously a
preferred choice if it's available. make_parser determines which
parsers are available and chooses the fastest one, so you don't have
to know what the different parsers are, or how they differ. (You can
also tell make_parser to try a list of parsers, if you want to use a
specific one).
Once you've created a parser instance, calling the setContentHandler()
method tells the parser what to use as the content handler. There are
similar methods for setting the other handlers: setDTDHandler(),
setEntityResolver(), and setErrorHandler().
If you run the above code with the sample XML document, it'll print
Sandman #62 found.
5.2 Error Handling
Now, try running the above code with this file as input:
&foo;
The &foo; entity is unknown, and the comic element isn't closed (if it
was empty, there would be a "/" before the closing ">". As a result,
you get a SAXParseException, e.g.
xml.sax._exceptions<.SAXParseException: undefined entity at None:2:2
The default code for the ErrorHandler interface automatically raises
an exception for any error; if that is what you want, you don't need
to implement an error handler class at all. Otherwise, you can provide
your own version of the ErrorHandler interface, at minimum overriding
the error() and fatalError() methods. The minimal implementation for
each method can be a single line. The methods in the ErrorHandler
interface--warning(), error(), and fatalError()--are all passed a
single argument, an exception instance. The exception will always be a
subclass of SAXException, and calling str() on it will produce a
readable error message explaining the problem.
For example, if you just want to continue running if a recoverable
error occurs, simply define the error() method to print the exception
it's passed:
def error(self, exception):
import sys
sys.stderr.write("\%s\n" \% exception)
With this definition, non-fatal errors will result in an error
message, whereas fatal errors will continue to produce a traceback.
5.3 Searching Element Content
Let's tackle a slightly more complicated task: printing out all issues
written by a certain author. This now requires looking at element
content, because the writer's name is inside a writer element:
Peter Milligan.
The search will be performed using the following algorithm:
1. The startElement method will be more complicated. For comic
elements, the handler has to save the title and number, in case
this comic is later found to match the search criterion. For
writer elements, it sets a inWriterContent flag to true, and sets
a writerName attribute to the empty string.
2. Characters outside of XML tags must be processed. When
inWriterContent is true, these characters must be added to the
writerName string.
3. When the writer element is finished, we've now collected all of
the element's content in the writerName attribute, so we can check
if the name matches the one we're searching for, and if so, print
the information about this comic. We must also set inWriterContent
back to false.
Here's the first part of the code; this implements step 1.
from xml.sax import ContentHandler
import string
def normalize_whitespace(text):
"Remove redundant whitespace from a string"
return ' '.join(text.split())
class FindWriter(ContentHandler):
def __init__(self, search_name):
# Save the name we're looking for
self.search_name = normalize_whitespace(search_name)
# Initialize the flag to false
self.inWriterContent = 0
def startElement(self, name, attrs):
# If it's a comic element, save the title and issue
if name == 'comic':
title = normalize_whitespace(attrs.get('title', ""))
number = normalize_whitespace(attrs.get('number', ""))
self.this_title = title
self.this_number = number
# If it's the start of a writer element, set flag
elif name == 'writer':
self.inWriterContent = 1
self.writerName = ""
The startElement() method has been discussed previously. Now we have
to look at how the content of elements is processed.
The normalize_whitespace() function is important, and you'll probably
use it in your own code. XML treats whitespace very flexibly; you can
include extra spaces or newlines wherever you like. This means that
you must normalize the whitespace before comparing attribute values or
element content; otherwise the comparison might produce an incorrect
result due to the content of two elements having different amounts of
whitespace.
def characters(self, ch):
if self.inWriterContent:
self.writerName = self.writerName + ch
The characters() method is called for characters that aren't inside
XML tags. ch is a string of characters. It is not necessarily a byte
string; parsers may also provide a buffer object that is a slice of
the full document, or they may pass Unicode objects.
You also shouldn't assume that all the characters are passed in a
single function call. In the example above, there might be only one
call to characters() for the string "Peter Milligan", or it might call
characters() once for each character. Another, more realistic example:
if the content contains an entity reference, as in "Wagner &
Seagle", the parser might call the method three times; once for
"Wagner ", once for "&", represented by the entity reference, and
again for " Seagle".
For step 2 of the algorithm, characters() only has to check
inWriterContent, and if it's true, add the characters to the string
being built up.
Finally, when the writer element ends, the entire name has been
collected, so we can compare it to the name we're searching for.
def endElement(self, name):
if name == 'writer':
self.inWriterContent = 0
self.writerName = normalize_whitespace(self.writerName)
if self.search_name == self.writerName:
print 'Found:', self.this_title, self.this_number
To avoid being confused by differing whitespace, the
normalize_whitespace() function is called. This can be done because we
know that leading and trailing whitespace are insignificant for this
application.
End tags can't have attributes on them, so there's no attrs parameter
to the endElement() method. Empty elements with attributes, such as
"", will result in a call to
startElement(), followed immediately by a call to endElement().
5.4 Enabling Namespace Processing
SAX2 supports XML namespaces. If namespace processing is active,
parsers won't call startElement(), but instead will call a method
named startElementNS(). The default of this setting varies from parser
to parser, so you should always set it to a safe value (unless your
handler supports both namespace-aware and -unaware processing).
For example, our FindIssue content handler described in previous
section doesn't implement the namespace-aware methods, so we should
request that namespace processing is deactivated before beginning to
parse XML:
from xml.sax import make_parser
from xml.sax.handler import feature_namespaces
# Create a parser
parser = make_parser()
# Disable namespace processing
parser.setFeature(feature_namespaces, 0)
The second argument to setFeature() is the desired state of the
feature, mostly commonly a Boolean. You would call
parser.setFeature(feature_namespaces, 1) to enable namespace
processing.
Namespaces in XML work by first defining a namespace prefix that maps
to a given URI specified by the relevant DTD, and then using that
prefix to mark elements and attributes that come from that DTD. For
example, the XLink specification says that the namaspace URI is
"http://www.w3.org/1999/xlink". The following XML snippet includes
some XLink attributes:
The xmlns:xlink attribute on the root element declares that the prefix
"xlink" maps to the given URL. The elem element therefore has one
attribute named href that comes from the XLink namespace.
Namespace-aware methods expect (URI, name) tuples instead of just
element and attribute names; instead of "xlink:href", they would
receive ('http://www.w3.org/1999/xlink', 'href').
Note that the actual value of the prefix is immaterial, and software
shouldn't make assumptions about it. The XML document would have
exactly the same meaning if the root element said
"xmlns:pref1="http://..."" and the attribute name was given as
"pref1:href".
If namespace processing is turned on, you would have to write
startElementNS() and endElementNS() methods that looked like this:
def startElementNS(self, (uri, localname), qname, attrs):
...
def endElementNS(self, (uri, localname, qname):
...
The first argument is a 2-tuple containing the URI and the name of the
element within that namespace. qname is a string containing the
original qualified name of the element, such as "xlink:a", and attrs
is a dictionary of attributes. The keys of this dictionary will be
(URI, attribute_name) pairs. If no namespace is specified for an
element or attribute, the URI will given given as None.
6 DOM: The Document Object Model
With SAX you write a class which then gets the entire document poured
through it as a sequence of method calls. An alternative approach is
that taken by the Document Object Model, or DOM, which turns an XML
document into a tree that's fully resident in memory.
A top-level Document instance is the root of the tree, and has a
single child which is the top-level Element instance; this Element has
child nodes representing the content and any sub-elements, which may
in turn have further children and so forth. There are different
classes for everything that can be found in an XML document, so in
addition to the Element class, there are also classes such as Text,
Comment, CDATASection, EntityReference, and so on. Nodes have methods
for accessing the parent and child nodes, accessing element and
attribute values, insert and delete nodes, and converting the tree
back into XML.
The DOM is often useful for modifying XML documents, because you can
create a DOM tree, modify it by adding new nodes and moving subtrees
around, and then produce a new XML document as output. On the other
hand, while the DOM doesn't require that the entire tree be resident
in memory at one time, the Python DOM implementation currently keeps
the whole tree in RAM. This means you may not have enough memory to
process very large documents as a DOM tree. A SAX handler, on the
other hand, can potentially churn through amounts of data far larger
than the available RAM.
This HOWTO can't be a complete introduction to the Document Object
Model, because there are lots of interfaces and lots of methods.
Luckily, the DOM Recommendation is quite readable, so I'd recommend
that you read it to get a complete picture of the available
interfaces. This section will only be a partial overview.
See Also:
Document Object Model (DOM) Level 1
The first version of the DOM endorsed by the W3C. Unlike most
standards, this one is actually pretty readable, particularly
if you're only interested in the Core XML interfaces.
Document Object Model (DOM) Technical Reports
Level 2 of the DOM has been defined, adding more specialized
features such as support for XML namespaces, events, and
ranges. DOM Level 3 is still being worked on, and will add yet
more features. This overview provides a concise summary of the
current status of each specification, and links to the latest
version of each.
6.1 Getting A DOM Tree
The easiest way to get a DOM tree is to have it built for you. PyXML
offers two alternative implementations of the DOM, xml.dom.minidom and
4DOM. xml.dom.minidom is included in Python 2. It is a minimal
implementation, which means it does not provide all interfaces and
operations required by the DOM standard. 4DOM, part of the 4Suite set
of XML tools (http://www.4suite.org), is a complete implementation of
DOM Level 2 Core, so we will use that in the examples.
The xml.dom.ext.reader package contains a number of classes that build
a DOM tree from various input sources. One of the modules in the
xml.dom package is named Sax2, and contains a Reader class that builds
a DOM tree from a series of SAX2 events. Reader instances provide a
fromStream() method that constructs a DOM tree from an input stream;
the input can be a file-like object or a string. In the second case,
it will be assumed to be a URL and will be opened with the urllib2
module. The advantage of using urllib2 over urllib is that HTTP errors
will be reported as exceptions.
import sys
from xml.dom.ext.reader import Sax2
# create Reader object
reader = Sax2.Reader()
# parse the document
doc = reader.fromStream(sys.stdin)
fromStream() returns the root of a DOM tree constructed from the input
XML document.
6.2 Printing The Tree
We'll use a single example document throughout this section. Here's
the sample:
No description
XML bookmarks
SIG for XML Processing in Python
Converted to a DOM tree, this document could produce the following
tree.
Element xbel None
Text #text ' \012 '
ProcessingInstruction processing 'instruction'
Text #text '\012 '
Element desc None
Text #text 'No description'
Text #text '\012 '
Element folder None
Text #text '\012 '
Element title None
Text #text 'XML bookmarks'
Text #text '\012 '
Element bookmark None
Text #text '\012 '
Element title None
Text #text 'SIG for XML Processing in Python'
Text #text '\012 '
Text #text '\012 '
Text #text '\012'
This isn't the only possible tree, because different parsers may
differ in how they generate Text nodes; any of the Text nodes in the
above tree might be split into multiple nodes.
A DOM tree can be converted back to XML by using the Print(doc,
stream) or PrettyPrint(doc, stream) functions in the xml.dom.ext
module. If stream isn't provided, the resulting XML will be printed to
standard output. Print() will simply render the DOM tree without any
changes, while PrettyPrint() will add or remove whitespace in order to
nicely indent the resulting XML.
6.3 Manipulating the Tree
We'll start by considering the basic Node class. All the other DOM
nodes--Document, Element, Text, and so forth--are subclasses of Node.
It's possible to perform many tasks using just the interface provided
by Node.
First, there are the attributes provided by all Node instances:
Attribute Meaning
nodeType Integer constant giving the type of this node: ELEMENT_NODE,
TEXT_NODE, etc.
nodeName Name of this node. For some types of node, such as Elements,
the name is the element name; for others, such as Text, the name is a
constant value such as "#text" which isn't very useful.
nodeValue Value of this node. For some types of node, such as Text
nodes, the value is a string containing a chunk of textual data; for
others, such as Text, the value is just None.
parentNode Parent of this node, or None if this node is the root of a
tree (usually meaning that it's a Document node).
childNodes A possibly empty list containing the children of this node.
firstChild First child of this node, or None if it has no children.
lastChild Last child of this node, or None if it has no children.
previousSibling Preceding child of this node's parent, or None if this
node has no parent or if the parent has no preceding children.
nextSibling Following child of this's node's parent, or None if this
node has no parent or if the parent has no following children.
ownerDocument Owning document of this node.
attributes A NamedNodeMap instance that behaves mostly like a
dictionary, and maps attribute names to Attribute instances.
Next, there are the methods. If a node is already a child of node 1
and is added as a child of node 2, it will automatically be removed
from node 1; nodes always have exactly zero or one parents.
Method Effect
appendChild(newChild) Add newChild as a child of this node, adding it
to the end of the list of children.
removeChild(oldChild) Remove oldChild; its parentNode attribute will
now return None.
replaceChild(newChild, oldChild Replace the child oldChild with
newChild. oldChild must already be a child of the node.
insertBefore(newChild, refChild) Add newChild as a child of this node,
adding it before the node refChild. refChild must already be a child
of the node.
hasChildNodes() Returns true if this node has any children.
cloneNode(deep) Returns a copy of this node. If deep is false, the
copy will have no children. If it's true, then all of the children
will also be copied and added as children to the returned copy.
Element nodes and the Document node also have a useful method,
getElementsByTagName(tagName), that returns a list of all elements
with the given name. For example, all the "chapter" elements can be
returned by document.getElementsByTagName('chapter').
6.4 Creating New Nodes
The base of the entire tree is the Document node. Its documentElement
attribute contains the Element node for the root element. The Document
node may have additional children, such as ProcessingInstruction
nodes, but the list of children can include at most one Element node.
When building a DOM tree from scratch, you'll need to construct new
nodes of various types such as Element and Text. The Document node has
a bunch of create*() methods such as createElement and
createTextNode().
For example, here's an example that adds a new child element named
"chapter" to the root element.
new = document.createElement('chapter')
new.setAttribute('number', '5')
document.documentElement.appendChild(new)
6.5 Walking Over The Entire Tree
Once you have a tree, another common task is to traverse it. Document
instances have a method called createTreeWalker(root, whatToShow,
filter, entityRefExpansion) that returns an instance of the TreeWalker
class.
Once you have a TreeWalker instance, it allows traversing through the
subtree rooted at the root node. The currentNode attribute contains
the current node that's been reached in this traversal, and can be
advanced forward or backward by calling the nextNode() and
previousNode() methods. There are also methods titled parentNode(),
firstChild(), lastChild(), and nextSibling(), previousSibling() that
return the appropriate value for the current node.
whattoshow is a bitmask with bits set for each type of node that you
want to see in the traversal. Constants are available as attributes on
the NodeFilter class. 0 filters out all nodes, NodeFilter.SHOW_ALL
traverses every node, and constants such as SHOW_ELEMENT and SHOW_TEXT
select individual types of node.
filter is a function that will be passed every traversed node, and can
return NodeFilter.FILTER_ACCEPT or NodeFilter.FILTER_REJECT to accept
or reject the node. filter can be passed as None in order to accept
all nodes.
Here's an example that traverses the entire tree and prints out every
element.
from xml.dom.NodeFilter import NodeFilter
walker = doc.createTreeWalker(doc.documentElement,
NodeFilter.SHOW_ELEMENT, None, 0)
while 1:
print walker.currentNode.tagName
next = walker.nextNode()
if next is None: break
7 XPath and XPointer
XPath is a relatively simple language for writing expressions that
select a subset of the nodes in a DOM tree. Here are some example
XPath expressions, and what nodes they match:
Expression Meaning
child::para Selects all children of the context node that are para
elements.
child::para[5] Selects the fifth child of the context node that are
para elements.
descendant::para Selects all descendants of the context node that are
para elements.
ancestor::* Selects all ancestors of the context node
Consult the XPath Recommendation for the full syntax and grammar.
The xml.xpath package contains a parser and evaluator for XPath
expressions. The Evaluate(expr, contextNode) function parses an
expression and evalates it with respect to the given Element context
node. For example:
from xml import xpath
nodes = xpath.Evaluate('quotation/note', doc.documentElement)
If doc is an appropriate DOM tree, then this will return a list
containing the subset of nodes denoted by the XPath expression.
See Also:
XML Path Language (XPath), Version 1.0
The full specification for XPath.
8 Marshalling Into XML
The xml.marshal package contains code for marshalling Python data
types and objects into XML. The xml.marshal.generic module uses a
simple DTD of its own, and provides Marshaller and Unmarshaller
classes that can be subclassed to marshal objects using a different
DTD. As an example, xml.marshal.wddx marshals Python objects into the
WDDX DTD.
The interface is the same as the standard Python marshal module:
dump(value, file) and dumps(value) convert value into XML and either
write it to the given file or return it as a string, while load(file)
and loads(string) perform the reverse conversion. For example:
>>> generic.dumps( (1, 2.0, 'name', [2,3,5,7]) )
"""
1
2.0
name
2
3
5
7
"""
>>>
(The output has been pretty-printed for clarity.)
Note that, at least in the generic module, strings are simply
incorporated in the XML output and therefore can't contain control
characters that are illegal in XML. If you need to marshal such
strings, you'll have to encode them using the binascii module before
calling the dump() function.
9 Acknowledgements
The author would like to thank the following people for offering
suggestions, corrections and assistance with various drafts of this
article: Fred L. Drake, Jr., Martin von Löwis, Uche Ogbuji, Rich Salz.
About this document ...
Python/XML HOWTO
This document was generated using the LaTeX2HTML translator.
LaTeX2HTML is Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos,
Computer Based Learning Unit, University of Leeds, and Copyright ©
1997, 1998, Ross Moore, Mathematics Department, Macquarie University,
Sydney.
The application of LaTeX2HTML to the Python documentation has been
heavily tailored by Fred L. Drake, Jr. Original navigation icons were
contributed by Christopher Petrilli.
_________________________________________________________________
Python/XML HOWTO
_________________________________________________________________
Release 0.7.1.