在 Python 中解析带有未声明前缀的 XML [英] Parsing XML with undeclared prefixes in Python
问题描述
我正在尝试使用使用前缀的 Python 解析 XML 数据,但并非每个文件都具有前缀声明.示例 XML:
I am trying to parse XML data with Python that uses prefixes, but not every file has the declaration of the prefix. Example XML:
<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>
我一直在使用 xml.etree.ElementTree 来解析这些文件,但是只要前缀没有正确声明,ElementTree 就会抛出解析错误.(未绑定前缀
,就在
的开头)搜索此错误使我找到了建议我修复命名空间声明的解决方案.但是,我无法控制需要使用的 XML,因此修改输入文件不是一个可行的选择.
I have been using xml.etree.ElementTree to parse these files, but whenever the prefix is not properly declared, ElementTree throws a parse error. (unbound prefix
, right at the start of <abc:thing2>
)
Searching for this error leads me to solutions that suggest I fix the namespace declaration. However, I do not control the XML that I need to work with, so modifying the input files is not a viable option.
搜索命名空间解析一般会让我产生很多关于以命名空间不可知的方式搜索的问题,这不是我所需要的.
Searching for namespace parsing in general leads me to many questions about searching in namespace-agnostic way, which is not what I need.
我正在寻找某种方法来自动解析这些文件,即使命名空间声明已损坏.我已经考虑过执行以下操作:
I am looking for some way to automatically parse these files, even if the namespace declaration is broken. I have thought about doing the following:
- 事先告诉 ElementTree 需要哪些命名空间,因为我知道哪些可以发生.我找到了
register_namespace
,但这似乎不起作用. - 在解析之前读入完整的 DTD,看看是否能解决问题.我找不到使用 ElementTree 执行此操作的方法.
- 告诉 ElementTree 不要理会命名空间.它不应该导致我的数据出现问题,但我发现没有办法做到这一点
- 使用其他一些可以处理这个问题的解析库——尽管我不想安装额外的库.我很难从文档中看出是否有其他人能够解决我的问题.
- 我目前没有看到的其他路线?
- tell ElementTree what namespaces to expect beforehand, because I do know which ones can occur. I found
register_namespace
, but that does not seem to work. - have the full DTD read in before parsing, and see if that solves it. I could not find a way to do this with ElementTree.
- tell ElementTree to not bother about namespaces at all. It should not cause issues with my data, but I found no way to do this
- use some other parsing library that can handle this issue - though I prefer not to need installation of extra libraries. I have difficulty seeing from the documentation if any others would be able to solve my issue.
- some other route that I am currently not seeing?
更新:在Har07把我放到lxml
的路径之后,我试着看看这是否能让我执行我想到的不同解决方案,结果会是什么:
UPDATE:
After Har07 put me on the path of lxml
, I tried to see if this would let me perform the different solutions I had thought of, and what the result would be:
- 事先告诉解析器期望的命名空间:我仍然找不到任何官方"的方法来做到这一点,但在我之前的搜索中,我找到了以编程方式简单地将必要的声明添加到数据的建议.(对于不同的编程情况 - 不幸的是,我再也找不到链接了)这对我来说似乎非常糟糕,但我还是尝试了.它涉及将数据作为字符串加载,更改封闭元素以具有正确的
xmlns
声明,然后将其交给lxml.etree
的fromstring代码>方法.不幸的是,这还需要从字符串中删除对编码声明的所有引用.不过它确实有效.
- 在解析之前读入 DTD:可以使用
lxml
(通过attribute_defaults
、dtd_validation
或load_dtd
>),但遗憾的是没有解决命名空间问题. - 告诉
lxml
不要担心命名空间:可以通过recover
选项实现.不幸的是,这也忽略了 XML 可能被破坏的其他方式(有关详细信息,请参阅 Har07 的回答)
- telling the parser what namespaces to expect beforehand: I still could not find any 'official' way to do this, but in my searches before I had found the suggestion to simply add the requisite declaration to the data programmatically. (for a different programming situation - unfortunately I can't find the link anymore) It seemed terribly hacky to me, but I tried it anyway. It involves loading the data as a string, changing the enclosing element to have the right
xmlns
declarations, and then handing it off tolxml.etree
'sfromstring
method. Unfortunately, that also requires removing all reference to encoding declaration from the string. It works, though. - Read in the DTD before parsing: it is possible with
lxml
(throughattribute_defaults
,dtd_validation
, orload_dtd
), but unfortunately does not solve the namespace issue. - Telling
lxml
not to bother about namespaces: possible through therecover
option. Unfortunately, that also ignores other ways in which the XML may be broken (see Har07's answer for details)
推荐答案
一种可能的方法是使用 ElementTree
兼容库,lxml
.例如:
One possible way is using ElementTree
compatible library, lxml
. For example :
from lxml import etree as ElementTree
xml = """<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
thing = tree.xpath("//thing")[0]
print(ElementTree.tostring(thing))
使用lxml
解析非格式良好的XML 所需要做的就是将参数recover=True
传递给XMLParser
的构造函数.lxml
还完全支持 xpath 1.0,这在您需要使用更复杂的条件获取部分 XML 文档时非常有用.
All you need to do for parsing a non well-formed XML using lxml
is passing parameter recover=True
to constructor of XMLParser
. lxml
also has full support for xpath 1.0 which is very useful when you need to get part of XML document using more complex criteria.
更新:
我不知道 recover=True
选项可以容忍的所有类型的 XML 错误.但是,除了未绑定的命名空间前缀之外,我还知道另一种类型的错误:未关闭的标记.lxml
将通过自动添加相应的结束标记来修复 - 而不是忽略 - 未关闭的标记.例如,给定以下损坏的 XML :
I don't know all the types of XML error that recover=True
option can tolerate. But here is another type of error that I know besides unbound namespace prefix: unclosed tag. lxml
will fix -rather than ignore- unclosed tag by adding corresponding closing tag automatically. For example, given the following broken XML :
xml = """<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
print(ElementTree.tostring(tree))
经过lxml
解析后的最终输出XML如下:
The final output XML after parsed by lxml
is as follow :
<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</bad></item>
这篇关于在 Python 中解析带有未声明前缀的 XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!