在 Python 中解析带有未声明前缀的 XML [英] Parsing XML with undeclared prefixes in Python

查看:67
本文介绍了在 Python 中解析带有未声明前缀的 XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用使用前缀的 Python 解析 XML 数据,但并非每个文件都具有前缀声明.示例 XML:

I am trying to parse XML data with Python that uses prefixes, but not every file has the declaration of the prefix. Example XML:

<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
    <thing>Word</thing>
    <abc:thing2>Another Word</abc:thing2>
</item>

我一直在使用 xml.etree.ElementTree 来解析这些文件,但是只要前缀没有正确声明,ElementTree 就会抛出解析错误.(未绑定前缀,就在的开头)搜索此错误使我找到了建议我修复命名空间声明的解决方案.但是,我无法控制需要使用的 XML,因此修改输入文件不是一个可行的选择.

I have been using xml.etree.ElementTree to parse these files, but whenever the prefix is not properly declared, ElementTree throws a parse error. (unbound prefix, right at the start of <abc:thing2>) Searching for this error leads me to solutions that suggest I fix the namespace declaration. However, I do not control the XML that I need to work with, so modifying the input files is not a viable option.

搜索命名空间解析一般会让我产生很多关于以命名空间不可知的方式搜索的问题,这不是我所需要的.

Searching for namespace parsing in general leads me to many questions about searching in namespace-agnostic way, which is not what I need.

我正在寻找某种方法来自动解析这些文件,即使命名空间声明已损坏.我已经考虑过执行以下操作:

I am looking for some way to automatically parse these files, even if the namespace declaration is broken. I have thought about doing the following:

  • 事先告诉 ElementTree 需要哪些命名空间,因为我知道哪些可以发生.我找到了 register_namespace,但这似乎不起作用.
  • 在解析之前读入完整的 DTD,看看是否能解决问题.我找不到使用 ElementTree 执行此操作的方法.
  • 告诉 ElementTree 不要理会命名空间.它不应该导致我的数据出现问题,但我发现没有办法做到这一点
  • 使用其他一些可以处理这个问题的解析库——尽管我不想安装额外的库.我很难从文档中看出是否有其他人能够解决我的问题.
  • 我目前没有看到的其他路线?
  • tell ElementTree what namespaces to expect beforehand, because I do know which ones can occur. I found register_namespace, but that does not seem to work.
  • have the full DTD read in before parsing, and see if that solves it. I could not find a way to do this with ElementTree.
  • tell ElementTree to not bother about namespaces at all. It should not cause issues with my data, but I found no way to do this
  • use some other parsing library that can handle this issue - though I prefer not to need installation of extra libraries. I have difficulty seeing from the documentation if any others would be able to solve my issue.
  • some other route that I am currently not seeing?

更新:在Har07把我放到lxml的路径之后,我试着看看这是否能让我执行我想到的不同解决方案,结果会是什么:

UPDATE: After Har07 put me on the path of lxml, I tried to see if this would let me perform the different solutions I had thought of, and what the result would be:

  • 事先告诉解析器期望的命名空间:我仍然找不到任何官方"的方法来做到这一点,但在我之前的搜索中,我找到了以编程方式简单地将必要的声明添加到数据的建议.(对于不同的编程情况 - 不幸的是,我再也找不到链接了)这对我来说似乎非常糟糕,但我还是尝试了.它涉及将数据作为字符串加载,更改封闭元素以具有正确的 xmlns 声明,然后将其交给 lxml.etreefromstring方法.不幸的是,这还需要从字符串中删除对编码声明的所有引用.不过它确实有效.
  • 在解析之前读入 DTD:可以使用 lxml(通过 attribute_defaultsdtd_validationload_dtd>),但遗憾的是没有解决命名空间问题.
  • 告诉 lxml 不要担心命名空间:可以通过 recover 选项实现.不幸的是,这也忽略了 XML 可能被破坏的其他方式(有关详细信息,请参阅 Har07 的回答)
  • telling the parser what namespaces to expect beforehand: I still could not find any 'official' way to do this, but in my searches before I had found the suggestion to simply add the requisite declaration to the data programmatically. (for a different programming situation - unfortunately I can't find the link anymore) It seemed terribly hacky to me, but I tried it anyway. It involves loading the data as a string, changing the enclosing element to have the right xmlns declarations, and then handing it off to lxml.etree's fromstring method. Unfortunately, that also requires removing all reference to encoding declaration from the string. It works, though.
  • Read in the DTD before parsing: it is possible with lxml (through attribute_defaults, dtd_validation, or load_dtd), but unfortunately does not solve the namespace issue.
  • Telling lxml not to bother about namespaces: possible through the recover option. Unfortunately, that also ignores other ways in which the XML may be broken (see Har07's answer for details)

推荐答案

一种可能的方法是使用 ElementTree 兼容库,lxml.例如:

One possible way is using ElementTree compatible library, lxml. For example :

from lxml import etree as ElementTree

xml = """<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
    <thing>Word</thing>
    <abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)

thing = tree.xpath("//thing")[0]
print(ElementTree.tostring(thing))

使用lxml 解析非格式良好的XML 所需要做的就是将参数recover=True 传递给XMLParser 的构造函数.lxml 还完全支持 xpath 1.0,这在您需要使用更复杂的条件获取部分 XML 文档时非常有用.

All you need to do for parsing a non well-formed XML using lxml is passing parameter recover=True to constructor of XMLParser. lxml also has full support for xpath 1.0 which is very useful when you need to get part of XML document using more complex criteria.

更新:

我不知道 recover=True 选项可以容忍的所有类型的 XML 错误.但是,除了未绑定的命名空间前缀之外,我还知道另一种类型的错误:未关闭的标记.lxml 将通过自动添加相应的结束标记来修复 - 而不是忽略 - 未关闭的标记.例如,给定以下损坏的 XML :

I don't know all the types of XML error that recover=True option can tolerate. But here is another type of error that I know besides unbound namespace prefix: unclosed tag. lxml will fix -rather than ignore- unclosed tag by adding corresponding closing tag automatically. For example, given the following broken XML :

xml = """<item subtype="bla">
    <thing>Word</thing>
    <bad>
    <abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)

print(ElementTree.tostring(tree))

经过lxml解析后的最终输出XML如下:

The final output XML after parsed by lxml is as follow :

<item subtype="bla">
    <thing>Word</thing>
    <bad>
    <abc:thing2>Another Word</abc:thing2>
</bad></item>

这篇关于在 Python 中解析带有未声明前缀的 XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆