Python lxml:忽略XML声明(错误) [英] Python lxml: Ignore XML declaration (errors)

查看:321
本文介绍了Python lxml:忽略XML声明(错误)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用lxml Python模块解析文件浏览器Thunar的自定义操作文件(~/.config/Thunar/uca.xml).

I am trying to parse the file browser Thunar's custom actions files (~/.config/Thunar/uca.xml) with the lxml Python module.

出于某种原因,Thunar显然在这些文件中写入了malformed declaration:

For some reason, Thunar obviously writes a malformed declaration into these files:

<?xml encoding="UTF-8" version="1.0"?>

很明显,预计version将作为声明中的第一个属性"出现.如果我尝试解析文件,则lxml会引发XMLSyntaxError.

Obviously, the version is expected to appear as the first "attribute" in the declaration. lxml raises an XMLSyntaxError if I try to parse the file.

不,我不能简单地更正该声明,因为Thunar一直用虚假的声明覆盖它.

And no, I cannot simply correct the declaration, becaue Thunar keeps overwriting it with the bogus one.

这很可能是Thunar中的错误.

This might very likely be a bug in Thunar.

尽管如此,我想知道如何使用lxml忽略XML声明.

Nevertheless, I would like to know how to ignore the XML declaration with lxml.

我知道我可以预处理XML文档以过滤掉XML声明.但这似乎不是很优雅.由于XML似乎默认使用1.0版和UTF-8编码,因此肯定有可能忽略声明并假定lxml中的声明.我在文档中或Google上都找不到任何东西,我可能忽略了一些东西.

I know that I could pre-process the XML document to filter out the XML declaration. But this doesn't seem very elegant. Since XML seems to default to version 1.0 and UTF-8 encoding, there surely is a possibility to just ignore the declaration and assume that in lxml. I didn't find anything in the documentation or on google, I might have overlooked something.

推荐答案

我对Thunar知之甚少,但是如果它在问题中产生XML声明,那就是一个错误.错误的XML声明会使文档格式错误.

I know very little about Thunar, but if it produces the XML declaration in the question, then that is a bug. Having an incorrect XML declaration makes the document ill-formed.

XML语法为XML声明中的项目指定了一个正确的顺序. version必须排在第一位,encoding其次.请参见 http://w3.org/TR/xml/#NT-XMLDecl .

The XML grammar specifies one correct order for the items in the XML declaration. version must come first and encoding second. See http://w3.org/TR/xml/#NT-XMLDecl.

但是,通过lxml,您可以使用将recover选项设置为True的解析器实例进行解析.在这种情况下,它可以工作.错误的XML声明将被忽略.

However, with lxml you can parse using a parser instance that has the recover option set to True. It works in this case. The bad XML declaration is ignored.

from lxml import etree 

parser = etree.XMLParser(recover=True)
tree = etree.parse('uca.xml', parser)

请参见 http://lxml.de/api/lxml.etree. XMLParser-class.html

这篇关于Python lxml:忽略XML声明(错误)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆