有没有一种方法可以对无效的Char值恢复iterparse? [英] Is there a way to recover iterparse on invalid Char values?
问题描述
我正在使用lxml的iterparse
解析一些大的XML文件(3-5Gig).由于这些文件中的某些文件包含无效字符,因此会抛出lxml.etree.XMLSyntaxError
.
I'm using lxml's iterparse
to parse some big XML files (3-5Gig). Since some of these files have invalid characters a lxml.etree.XMLSyntaxError
is thrown.
使用lxml.etree.parse时,我可以提供一个解析器,该解析器可以恢复无效字符:
When using lxml.etree.parse I can provide a parser which recovers on invalid characters:
parser = lxml.etree.XMLParser(recover=True)
root = lxml.etree.parse(open("myMalformed.xml, parser)
有没有办法为iterparse获得相同的功能?
Is there a way to get the same functionality for iterparse?
修改: 编码在这里不是问题.这些XML文件中存在无效字符,可以通过定义具有restore = True的XMLParser来清除这些字符.由于我需要为此使用iterparse,因此无法使用自定义解析器.因此,我正在此处寻找以上代码段中提供的功能:
Encoding is not an Issue here. There are invalid characters in these XML files which can be sanitized by defining a XMLParser with recover=True. Since I need to use iterparse for this, I can't use a custom parser. So I'm looking for the functionality provided in my snippet above for this here:
context = etree.iterparse(open("myMalformed.xml", events=('end',), tag="Foo") <-- cant recover
推荐答案
当您说无效字符时,您是指unicode字符吗?如果是这样,您可以尝试
When you say invalid characters, do you mean unicode characters? If so you can try
lxml.etree.XMLParser(encoding='UTF-8', recover=True)
如果您的意思是XML格式错误,那么这显然是行不通的.如果您可以发布您的追溯,我们可以看到XMLSyntaxError
的性质,它将提供更多信息.
If you mean malformed XML then this obviously won't work. If you can post your traceback, we can see the nature of the XMLSyntaxError
which will provide more information.
这篇关于有没有一种方法可以对无效的Char值恢复iterparse?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!