解析utf8 xml时,lxml编码错误 [英] lxml encoding error when parsing utf8 xml

查看:87
本文介绍了解析utf8 xml时,lxml编码错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用lxml遍历XML文件(UTF-8编码,以开头),但是在字符丂上出现以下错误:

I'm trying to iterate through an XML file (UTF-8 encoded, starts with ) with lxml, but get the following error on the character 丂 :

UnicodeEncodeError:'cp932'编解码器无法在位置0:非法的多字节序列中对字符u'\ u4e02'进行编码

UnicodeEncodeError: 'cp932' codec can't encode character u'\u4e02' in position 0: illegal multibyte sequence

此之前的其他字符已正确打印.代码是:

Other characters before this are printed out correctly. The code is:

parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse("filename.xml", parser)
root = tree.getroot()
for elem in root:
    print elem[0].text

该错误是否表示它不是在utf-8中而是在Shift JIS中解析文件?

Does the error mean that it didn't parse the file in utf-8 but in shift JIS instead?

推荐答案

UnicodeEncodeError 的堆栈跟踪指向发生异常的位置.不幸的是您没有包括它,但很可能是将unicode文本打印到stdout的最后一行.我假设stdout在您的系统上使用 cp932 编码.

The stacktrace of the UnicodeEncodeError points to the location where the exception occurs. Unfortunately you didn’t include it but it’s most likely the last line where the unicode text is printed to stdout. I assume that stdout uses cp932 encoding on your system.

如果我的假设正确,则应考虑更改环境,以便stdout使用可以表示Unicode字符的编码(例如UTF-8).(例如,请参见通过python中的sys.stdout编写unicode字符串).

If my assumptions are correct you should consider changing your environment such that stdout uses an encoding that can represent unicode characters (like UTF-8). (see for example Writing unicode strings via sys.stdout in Python).

这篇关于解析utf8 xml时,lxml编码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆