Parsing XML

class jxmlease.Parser(**kwargs)[source]

Creates Python data structures from raw XML.

This class creates a callable object used to parse XML into Python data structures. You can provide optional parameters at the class creation time. These parameters modify the default behavior of the parser. When you invoke the callable object to parse a document, you can supply additional parameters to override the values specified when the Parser object was created.

General usage is:

>>> myparser = Parser()
>>> root = myparser("<a>foo</a>")

Calling a Parser object returns an XMLDictNode containing the parsed XML tree.

In this example, root is an XMLDictNode which contains a representation of the parsed XML:

>>> isinstance(root, XMLDictNode)
True
>>> root.prettyprint()
{u'a': u'foo'}
>>> print root.emit_xml()
<?xml version="1.0" encoding="utf-8"?>
<a>foo</a>

If you will just be using a parser once, you can just use the parse() method, which is a shortcut way of creating a Parser class and calling it all in one call. You can provide the same arguments to the parse() method that you provide to the Parser class.

For example:

>>> root = jxmlease.parse('<a x="y"><b>1</b><b>2</b><b>3</b></a>')
>>> root.prettyprint()
{u'a': {u'b': [u'1', u'2', u'3']}}

It is possible to call a Parser object as a generator by specifying the generator parameter. The generator parameter contains a list of paths to match. If paths are provided in this parameter, the behavior of the parser is changed. Instead of returning the root node of a parsed XML hierarchy, the parser returns a generator object. On each call to the generator object, it will return the next node that matches one of the provided paths.

Paths are provided in a format similar to XPath expressions. For example, /a/b will match node <b> in this XML:

<a>
    <b/>
</a>

If a path begins with a /, it must exactly match the full path to a node. If a path does not begin with a /, it must exactly match the “right side” of the path to a node. For example, consider this XML:

<a>
    <b>
        <c/>
    </b>
</a>

In this example, /a/b/c, c, b/c, and a/b/c all match the <c> node.

For each match, the generator returns a tuple of: (path,match_string,xml_node), where the path is the calculated absolute path to the matching node, match_string is the user-supplied match string that triggered the match, and xml_node is the object representing that node (an instance of a XMLNodeBase subclass).

For example:

>>> xml = '<a x="y"><b>1</b><b>2</b><b>3</b></a>'
>>> myparser = Parser(generator=["/a/b"])
>>> for (path, match, value) in myparser(xml):
...   print "%s: %s" % (path, value)
...
/a/b: 1
/a/b: 2
/a/b: 3

When calling the parser, you can specify all of these parameters. When creating a parsing instance, you can specify all of these parameters except xml_input:

Parameters:
  • xml_input (stirng or file-like object) – Contains the XML to parse.
  • encoding (string or None) – The input’s encoding. If not provided, this defaults to ‘utf-8’.
  • expat (An expat, or equivalent, parser class) – Used for parsing the XML input. If not provided, defaults to the expat parser in xml.parsers.
  • process_namespaces (bool) – If True, namespaces in tags and attributes are converted to their full URL value. If False (the default), the namespaces in tags and attributes are left unchanged.
  • namespace_separator (string) – If process_namespaces is True, this specifies the separator that expat should use between namespaces and identifiers in tags and attributes
  • xml_attribs (bool) – If True (the default), include XML attributes. If False, ignore them.
  • strip_whitespace (bool) – If True (the default), strip whitespace at the start and end of CDATA. If False, keep all whitespace.
  • namespaces (dict) – A remapping for namespaces. If supplied, identifiers with a namespace prefix will have their namespace prefix rewritten based on the dictionary. The code will look for namespaces[current_namespace]. If found, current_namespace will be replaced with the result of the lookup.
  • strip_namespace (bool) – If True, the namespace prefix will be removed from all identifiers. If False (the default), the namespace prefix will be retained.
  • cdata_separator (string) – When encountering “semi-structured” XML (where the XML has CDATA and tags intermixed at the same level), the cdata_separator will be placed between the different groups of CDATA. By default, the cdata_separator parameter is ‘’, which results in the CDATA groups being concatenated without separator.
  • generator (list of strings) – A list of paths to match. If paths are provided here, the behavior of the parser is changed. Instead of returning the root node of a parsed XML hierarchy, the parser returns a generator object. On each call to the generator object, it will return the next node that matches one of the provided paths.
Returns:

A callable instance of the Parser class.

Calling a Parser object returns an XMLDictNode containing the parsed XML tree.

Alternatively, if the generator parameter is specified, a generator object is returned.

__call__(xml_input, **kwargs)[source]

See class documentation.

__delattr__

x.__delattr__(‘name’) <==> del x.name

__format__()

default object formatter

__getattribute__

x.__getattribute__(‘name’) <==> x.name

__hash__
__reduce__()

helper for pickle

__reduce_ex__()

helper for pickle

__repr__
__setattr__

x.__setattr__(‘name’, value) <==> x.name = value

__sizeof__() → int

size of object in memory, in bytes

__str__
jxmlease.parse(xml_input, **kwargs)[source]

Create Python data structures from raw XML.

See the Parser class documentation.

class jxmlease.EtreeParser(**kwargs)[source]

Creates Python data structures from an ElementTree object.

This class returns a callable object. You can provide parameters at the class creation time. These parameters modify the default parameters for the parser. When you call the callable object to parse a document, you can supply additional parameters to override the default values.

General usage is like this:

>>> myparser = Parser()
>>> root = myparser(etree_root)

For detailed usage information, please see the :py:class`Parser` class. Other than the differences noted below, the behavior of the two classes should be the same. Namespace Identifiers:

In certain versions of ElementTree, the original namespace identifiers are not maintained. In these cases, the class will recreate namespace identfiers to represent the original namespaces. It will add appropriate xmlns attributes to maintain the original namespace mapping. However, the actual identifier will be lost. As best I can tell, this is a bug with ElementTree, rather than this code. To avoid this problem, use lxml.

Single-invocation Parsing:

If you will just be using a parser once, you can just use the parse_etree() method, which is a shortcut way of creating a EtreeParser class and calling it all in one call. You can provide the same arguments to the parse_etree() method that you can provide to the EtreeParser class.

Parameters:etree_root (ElementTree) – An ElementTree object representing the tree you wish to parse.

Also accepts most of the same arguments as the Parser class. However, it does not accept the xml_input, expat, or encoding parameters.

__call__(etree_root, **kwargs)[source]

See the class documentation.

__delattr__

x.__delattr__(‘name’) <==> del x.name

__format__()

default object formatter

__getattribute__

x.__getattribute__(‘name’) <==> x.name

__hash__
__reduce__()

helper for pickle

__reduce_ex__()

helper for pickle

__repr__
__setattr__

x.__setattr__(‘name’, value) <==> x.name = value

__sizeof__() → int

size of object in memory, in bytes

__str__
jxmlease.parse_etree(etree_root, **kwargs)[source]

Create Python data structures from an ElementTree object.

See the EtreeParser class documentation.