Parsing XML¶
-
class
jxmlease.Parser(**kwargs)[source]¶ Creates Python data structures from raw XML.
This class creates a callable object used to parse XML into Python data structures. You can provide optional parameters at the class creation time. These parameters modify the default behavior of the parser. When you invoke the callable object to parse a document, you can supply additional parameters to override the values specified when the
Parserobject was created.General usage is:
>>> myparser = Parser() >>> root = myparser("<a>foo</a>")
Calling a
Parserobject returns anXMLDictNodecontaining the parsed XML tree.In this example,
rootis anXMLDictNodewhich contains a representation of the parsed XML:>>> isinstance(root, XMLDictNode) True >>> root.prettyprint() {u'a': u'foo'} >>> print root.emit_xml() <?xml version="1.0" encoding="utf-8"?> <a>foo</a>
If you will just be using a parser once, you can just use the
parse()method, which is a shortcut way of creating aParserclass and calling it all in one call. You can provide the same arguments to theparse()method that you provide to theParserclass.For example:
>>> root = jxmlease.parse('<a x="y"><b>1</b><b>2</b><b>3</b></a>') >>> root.prettyprint() {u'a': {u'b': [u'1', u'2', u'3']}}
It is possible to call a
Parserobject as a generator by specifying thegeneratorparameter. Thegeneratorparameter contains a list of paths to match. If paths are provided in this parameter, the behavior of the parser is changed. Instead of returning the root node of a parsed XML hierarchy, the parser returns a generator object. On each call to the generator object, it will return the next node that matches one of the provided paths.Paths are provided in a format similar to XPath expressions. For example,
/a/bwill match node<b>in this XML:<a> <b/> </a>If a path begins with a
/, it must exactly match the full path to a node. If a path does not begin with a/, it must exactly match the “right side” of the path to a node. For example, consider this XML:<a> <b> <c/> </b> </a>In this example,
/a/b/c,c,b/c, anda/b/call match the<c>node.For each match, the generator returns a tuple of:
(path,match_string,xml_node), where the path is the calculated absolute path to the matching node, match_string is the user-supplied match string that triggered the match, and xml_node is the object representing that node (an instance of aXMLNodeBasesubclass).For example:
>>> xml = '<a x="y"><b>1</b><b>2</b><b>3</b></a>' >>> myparser = Parser(generator=["/a/b"]) >>> for (path, match, value) in myparser(xml): ... print "%s: %s" % (path, value) ... /a/b: 1 /a/b: 2 /a/b: 3
When calling the parser, you can specify all of these parameters. When creating a parsing instance, you can specify all of these parameters except
xml_input:Parameters: - xml_input (stirng or file-like object) – Contains the XML to parse.
- encoding (string or None) – The input’s encoding. If not provided, this defaults to ‘utf-8’.
- expat (An expat, or equivalent, parser class) – Used for parsing the XML
input. If not provided, defaults to the expat parser in
xml.parsers. - process_namespaces (bool) – If True, namespaces in tags and attributes are converted to their full URL value. If False (the default), the namespaces in tags and attributes are left unchanged.
- namespace_separator (string) – If
process_namespacesis True, this specifies the separator that expat should use between namespaces and identifiers in tags and attributes - xml_attribs (bool) – If True (the default), include XML attributes. If False, ignore them.
- strip_whitespace (bool) – If True (the default), strip whitespace at the start and end of CDATA. If False, keep all whitespace.
- namespaces (dict) – A remapping for namespaces. If supplied, identifiers
with a namespace prefix will have their namespace prefix rewritten
based on the dictionary. The code will look for
namespaces[current_namespace]. If found,current_namespacewill be replaced with the result of the lookup. - strip_namespace (bool) – If True, the namespace prefix will be removed from all identifiers. If False (the default), the namespace prefix will be retained.
- cdata_separator (string) – When encountering “semi-structured” XML
(where the XML has CDATA and tags intermixed at the same level), the
cdata_separatorwill be placed between the different groups of CDATA. By default, thecdata_separatorparameter is ‘’, which results in the CDATA groups being concatenated without separator. - generator (list of strings) – A list of paths to match. If paths are
provided here, the behavior of the parser is changed. Instead of
returning the root node of a parsed XML hierarchy, the parser
returns a
generatorobject. On each call to thegeneratorobject, it will return the next node that matches one of the provided paths.
Returns: A callable instance of the
Parserclass.Calling a
Parserobject returns anXMLDictNodecontaining the parsed XML tree.Alternatively, if the
generatorparameter is specified, ageneratorobject is returned.-
__delattr__¶ x.__delattr__(‘name’) <==> del x.name
-
__format__()¶ default object formatter
-
__getattribute__¶ x.__getattribute__(‘name’) <==> x.name
-
__hash__¶
-
__reduce__()¶ helper for pickle
-
__reduce_ex__()¶ helper for pickle
-
__repr__¶
-
__setattr__¶ x.__setattr__(‘name’, value) <==> x.name = value
-
__sizeof__() → int¶ size of object in memory, in bytes
-
__str__¶
-
jxmlease.parse(xml_input, **kwargs)[source]¶ Create Python data structures from raw XML.
See the
Parserclass documentation.
-
class
jxmlease.EtreeParser(**kwargs)[source]¶ Creates Python data structures from an ElementTree object.
This class returns a callable object. You can provide parameters at the class creation time. These parameters modify the default parameters for the parser. When you call the callable object to parse a document, you can supply additional parameters to override the default values.
General usage is like this:
>>> myparser = Parser() >>> root = myparser(etree_root)
For detailed usage information, please see the :py:class`Parser` class. Other than the differences noted below, the behavior of the two classes should be the same. Namespace Identifiers:
In certain versions of
ElementTree, the original namespace identifiers are not maintained. In these cases, the class will recreate namespace identfiers to represent the original namespaces. It will add appropriate xmlns attributes to maintain the original namespace mapping. However, the actual identifier will be lost. As best I can tell, this is a bug withElementTree, rather than this code. To avoid this problem, uselxml.Single-invocation Parsing:
If you will just be using a parser once, you can just use the
parse_etree()method, which is a shortcut way of creating aEtreeParserclass and calling it all in one call. You can provide the same arguments to theparse_etree()method that you can provide to theEtreeParserclass.Parameters: etree_root ( ElementTree) – AnElementTreeobject representing the tree you wish to parse.Also accepts most of the same arguments as the
Parserclass. However, it does not accept thexml_input,expat, orencodingparameters.-
__delattr__¶ x.__delattr__(‘name’) <==> del x.name
-
__format__()¶ default object formatter
-
__getattribute__¶ x.__getattribute__(‘name’) <==> x.name
-
__hash__¶
-
__reduce__()¶ helper for pickle
-
__reduce_ex__()¶ helper for pickle
-
__repr__¶
-
__setattr__¶ x.__setattr__(‘name’, value) <==> x.name = value
-
__sizeof__() → int¶ size of object in memory, in bytes
-
__str__¶
-
-
jxmlease.parse_etree(etree_root, **kwargs)[source]¶ Create Python data structures from an
ElementTreeobject.See the
EtreeParserclass documentation.