如何以与命名空间无关的方式在Python中通过XPath查找XML元素? [英] How to find XML Elements via XPath in Python in a namespace-agnostic way?

查看:130
本文介绍了如何以与命名空间无关的方式在Python中通过XPath查找XML元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因为我第二次遇到这个烦人的问题,所以我认为问问会有所帮助.

since I had this annoying issue for the 2nd time, I thought that asking would help.

有时候我必须从XML文档中获取Elements,但是这样做的方法很尴尬.

Sometimes I have to get Elements from XML documents, but the ways to do this are awkward.

我想知道一个可以满足我需求的python库,一种优雅的方式来表达我的XPath,一种在前缀中自动注册名称空间的方法,或者在内置XML实现或lxml中的隐藏首选项中剥离名称空间的方法.完全地.除非您已经知道我想要什么,否则请进行澄清:)

I’d like to know a python library that does what I want, a elegant way to formulate my XPaths, a way to register the namespaces in prefixes automatically or a hidden preference in the builtin XML implementations or in lxml to strip namespaces completely. Clarification follows unless you already know what I want :)

示例文档:

<root xmlns="http://really-long-namespace.uri"
  xmlns:other="http://with-ambivalent.end/#">
    <other:elem/>
</root>

我能做什么

ElementTree API是(我知道)唯一提供XPath查询的内置API.但这需要我使用"UNames".看起来像这样:/{http://really-long-namespace.uri}root/{http://with-ambivalent.end/#}elem

如您所见,它们非常冗长.我可以通过执行以下操作来缩短它们:

As you can see, these are quite verbose. I can shorten them by doing the following:

default_ns = "http://really-long-namespace.uri"
other_ns   = "http://with-ambivalent.end/#"
doc.find("/{{{0}}}root/{{{1}}}elem".format(default_ns, other_ns))

但这既{{{ugly}}}又脆弱,因为http…end/# 3 http…end#http…end/http…end,而我是谁知道将使用哪种变体?

But this is both {{{ugly}}} and fragile, since http…end/#http…end#http…end/http…end, and who am I to know which variant will be used?

此外,lxml支持名称空间前缀,但它既不使用文档中的名称前缀,也不提供自动方式来处理默认名称空间.我仍然必须获取每个命名空间的一个元素才能从文档中检索它.命名空间属性不会保留,因此也无法从这些属性中自动检索它们.

Also, lxml supports namespace prefixes, but it does neither use the ones in the document, nor provides an automated way to deal with default namespaces. I would still have to get one element of each namespace to retrieve it from the document. Namespace attributes are not preserved, so no way of automatically retrieving them from these, too.

XPath查询也有一种与名称空间无关的方式,但是它既冗长/难看又在内置实现中不可用:/*[local-name() = 'root']/*[local-name() = 'elem']

There is a namespace-agnostic way of XPath queries, too, but it is both verbose/ugly and unavailable in the builtin implementation: /*[local-name() = 'root']/*[local-name() = 'elem']

我想找到一个库,选项或通用的XPath-morphing函数来实现上述示例,而只需输入以下内容即可……

I want to find a library, option or generic XPath-morphing function to achieve above examples by typing little more than the following…

  1. 未命名空间:/root/elem
  2. 文档中的名称空间前缀:/root/other:elem
  1. Unnamespaced: /root/elem
  2. Namespace-prefixes from document: /root/other:elem

…也许还有一些我确实想使用文档前缀或去除名称空间的语句.

…plus maybe some statements that i indeed want to use the document’s prefixes or strip the namespaces.

进一步澄清:尽管我的当前用例是如此简单,但将来我将不得不使用更复杂的用例.

Further clarification: although my current use case is as simple as that, I will have to use more complex ones in the future.

感谢阅读!

用户samplebias将我的注意力转移到 py-dom-xpath ;正是我想要的.我的实际代码现在看起来像这样:

The user samplebias directed my attention to py-dom-xpath; Exactly what i was looking for. My actual code now looks like this:

#parse the document into a DOM tree
rdf_tree = xml.dom.minidom.parse("install.rdf")
#read the default namespace and prefix from the root node
context = xpath.XPathContext(rdf_tree)

name    = context.findvalue("//em:id", rdf_tree)
version = context.findvalue("//em:version", rdf_tree)

#<Description/> inherits the default RDF namespace
resource_nodes = context.find("//Description/following-sibling::*", rdf_tree)

与文档一致,简单,具有名称空间意识;完美.

Consistent with the document, simple, namespace-aware; perfect.

推荐答案

*[local-name() = "elem"]语法应该可以使用,但是为了简化操作,您可以创建一个函数来简化部分或完整的通配符命名空间" XPath表达式的构造.

The *[local-name() = "elem"] syntax should work, but to make it easier you can create a function to simplify construction of the partial or full "wildcard namespace" XPath expressions.

我在Ubuntu 10.04上使用 python-lxml 2.2.4 ,下面的脚本适合我.您将需要根据要为每个元素指定默认名称空间的方式自定义行为,并处理要折叠到表达式中的任何其他XPath语法:

I'm using python-lxml 2.2.4 on Ubuntu 10.04 and the script below works for me. You'll need to customize the behavior depending on how you want to specify the default namespaces for each element, plus handle any other XPath syntax you want to fold into the expression:

import lxml.etree

def xpath_ns(tree, expr):
    "Parse a simple expression and prepend namespace wildcards where unspecified."
    qual = lambda n: n if not n or ':' in n else '*[local-name() = "%s"]' % n
    expr = '/'.join(qual(n) for n in expr.split('/'))
    nsmap = dict((k, v) for k, v in tree.nsmap.items() if k)
    return tree.xpath(expr, namespaces=nsmap)

doc = '''<root xmlns="http://really-long-namespace.uri"
    xmlns:other="http://with-ambivalent.end/#">
    <other:elem/>
</root>'''

tree = lxml.etree.fromstring(doc)
print xpath_ns(tree, '/root')
print xpath_ns(tree, '/root/elem')
print xpath_ns(tree, '/root/other:elem')

输出:

[<Element {http://really-long-namespace.uri}root at 23099f0>]
[<Element {http://with-ambivalent.end/#}elem at 2309a48>]
[<Element {http://with-ambivalent.end/#}elem at 2309a48>]

更新:如果发现确实需要解析XPath,则可以签出

Update: If you find out you do need to parse XPaths, you can check out projects like py-dom-xpath which is a pure Python implementation of (most of) XPath 1.0. In the least that will give you some idea of the complexity of parsing XPath.

这篇关于如何以与命名空间无关的方式在Python中通过XPath查找XML元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆