有没有一种方法可以对无效的Char值恢复iterparse? [英] Is there a way to recover iterparse on invalid Char values?

查看:89
本文介绍了有没有一种方法可以对无效的Char值恢复iterparse?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用lxml的iterparse解析一些大的XML文件(3-5Gig).由于这些文件中的某些文件包含无效字符,因此会抛出lxml.etree.XMLSyntaxError.

I'm using lxml's iterparse to parse some big XML files (3-5Gig). Since some of these files have invalid characters a lxml.etree.XMLSyntaxError is thrown.

使用lxml.etree.parse时,我可以提供一个解析器,该解析器可以恢复无效字符:

When using lxml.etree.parse I can provide a parser which recovers on invalid characters:

parser = lxml.etree.XMLParser(recover=True)
root = lxml.etree.parse(open("myMalformed.xml, parser)

有没有办法为iterparse获得相同的功能?

Is there a way to get the same functionality for iterparse?

修改: 编码在这里不是问题.这些XML文件中存在无效字符,可以通过定义具有restore = True的XMLParser来清除这些字符.由于我需要为此使用iterparse,因此无法使用自定义解析器.因此,我正在此处寻找以上代码段中提供的功能:

Encoding is not an Issue here. There are invalid characters in these XML files which can be sanitized by defining a XMLParser with recover=True. Since I need to use iterparse for this, I can't use a custom parser. So I'm looking for the functionality provided in my snippet above for this here:

context = etree.iterparse(open("myMalformed.xml", events=('end',), tag="Foo") <-- cant recover

推荐答案

当您说无效字符时,您是指unicode字符吗?如果是这样,您可以尝试

When you say invalid characters, do you mean unicode characters? If so you can try

lxml.etree.XMLParser(encoding='UTF-8', recover=True)

如果您的意思是XML格式错误,那么这显然是行不通的.如果您可以发布您的追溯,我们可以看到XMLSyntaxError的性质,它将提供更多信息.

If you mean malformed XML then this obviously won't work. If you can post your traceback, we can see the nature of the XMLSyntaxError which will provide more information.

这篇关于有没有一种方法可以对无效的Char值恢复iterparse?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆