Python lxml:忽略XML声明(错误) [英] Python lxml: Ignore XML declaration (errors)
问题描述
我正在尝试使用lxml
Python模块解析文件浏览器Thunar的自定义操作文件(~/.config/Thunar/uca.xml
).
I am trying to parse the file browser Thunar's custom actions files (~/.config/Thunar/uca.xml
) with the lxml
Python module.
出于某种原因,Thunar显然在这些文件中写入了malformed declaration
:
For some reason, Thunar obviously writes a malformed declaration
into these files:
<?xml encoding="UTF-8" version="1.0"?>
很明显,预计version
将作为声明中的第一个属性"出现.如果我尝试解析文件,则lxml
会引发XMLSyntaxError
.
Obviously, the version
is expected to appear as the first "attribute" in the declaration. lxml
raises an XMLSyntaxError
if I try to parse the file.
不,我不能简单地更正该声明,因为Thunar一直用虚假的声明覆盖它.
And no, I cannot simply correct the declaration, becaue Thunar keeps overwriting it with the bogus one.
这很可能是Thunar中的错误.
This might very likely be a bug in Thunar.
尽管如此,我想知道如何使用lxml
忽略XML声明.
Nevertheless, I would like to know how to ignore the XML declaration with lxml
.
我知道我可以预处理XML文档以过滤掉XML声明.但这似乎不是很优雅.由于XML似乎默认使用1.0版和UTF-8编码,因此肯定有可能忽略声明并假定lxml
中的声明.我在文档中或Google上都找不到任何东西,我可能忽略了一些东西.
I know that I could pre-process the XML document to filter out the XML declaration. But this doesn't seem very elegant. Since XML seems to default to version 1.0 and UTF-8 encoding, there surely is a possibility to just ignore the declaration and assume that in lxml
. I didn't find anything in the documentation or on google, I might have overlooked something.
推荐答案
我对Thunar知之甚少,但是如果它在问题中产生XML声明,那就是一个错误.错误的XML声明会使文档格式错误.
I know very little about Thunar, but if it produces the XML declaration in the question, then that is a bug. Having an incorrect XML declaration makes the document ill-formed.
XML语法为XML声明中的项目指定了一个正确的顺序. version
必须排在第一位,encoding
其次.请参见 http://w3.org/TR/xml/#NT-XMLDecl .
The XML grammar specifies one correct order for the items in the XML declaration. version
must come first and encoding
second. See http://w3.org/TR/xml/#NT-XMLDecl.
但是,通过lxml,您可以使用将recover
选项设置为True
的解析器实例进行解析.在这种情况下,它可以工作.错误的XML声明将被忽略.
However, with lxml you can parse using a parser instance that has the recover
option set to True
. It works in this case. The bad XML declaration is ignored.
from lxml import etree
parser = etree.XMLParser(recover=True)
tree = etree.parse('uca.xml', parser)
请参见 http://lxml.de/api/lxml.etree. XMLParser-class.html
这篇关于Python lxml:忽略XML声明(错误)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!