lxml unicode实体解析问题 [英] lxml unicode entity parse problems
问题描述
我正在如下使用lxml来解析从另一个系统导出的XML文件:
I'm using lxml as follows to parse an exported XML file from another system:
xmldoc = open(filename)
etree.parse(xmldoc)
但是我得到了
lxml.etree.XMLSyntaxError:实体 未定义紧急"行4495, 第46栏
lxml.etree.XMLSyntaxError: Entity 'eacute' not defined, line 4495, column 46
很显然,Unicode实体名称存在问题-但是我将如何解决呢?通过open()还是parse()?
Obviously it's having problems with unicode entity names - but how would i get round this? Via open() or parse()?
编辑:我忘了将DTD包含在同一文件夹中-它现在已经存在,并且具有以下声明:
I had forgotten to include my DTD in the same folder - it's there now and has the following declaration:
<!ENTITY eacute "é">
,并且在xmldoc中被这样引用(并且一直被引用):
and is referred to (and always was) in xmldoc as so:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE DScribeDatabase SYSTEM "foo.dtd">
但是我仍然遇到相同的问题... DTD是否也需要在Python中声明?
Yet I still get the same problem ... does the DTD need to be declared in Python too?
推荐答案
eacute
不是XML中的预定义实体.要在XML文件中包含é
实体引用,它必须具有<!DOCTYPE>
声明,该声明指向定义该实体的DTD(例如XHTML 1.0 DTD).
eacute
is not a predefined entity in XML. To include an é
entity reference in an XML file, it must have a <!DOCTYPE>
declaration pointing to a DTD (such as an XHTML 1.0 DTD) that defines the entity.
如果XML使用é
但没有<!DOCTYPE>
,则该XML格式不正确,并且导出它的系统也必须固定.
If the XML uses é
but doesn't have a <!DOCTYPE>
, it is not well-formed and the system that exported it needs to be fixed.
(没有充分的理由使用实体引用来表示XML文件中的é
.字符引用é
在没有实体定义的情况下无处不在,如果文件不能简单地包含原始UTF的话-8 é
由于某种原因.)
(There isn't a good reason to use an entity reference to represent é
in an XML file. The character reference é
is understood everywhere without entity definitions, if the file can't simply include a raw UTF-8 é
for some reason.)
这篇关于lxml unicode实体解析问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!