解析utf8 xml时,lxml编码错误 [英] lxml encoding error when parsing utf8 xml
问题描述
我正在尝试使用lxml遍历XML文件(UTF-8编码,以开头),但是在字符丂上出现以下错误:
I'm trying to iterate through an XML file (UTF-8 encoded, starts with ) with lxml, but get the following error on the character 丂 :
UnicodeEncodeError:'cp932'编解码器无法在位置0:非法的多字节序列中对字符u'\ u4e02'进行编码
UnicodeEncodeError: 'cp932' codec can't encode character u'\u4e02' in position 0: illegal multibyte sequence
此之前的其他字符已正确打印.代码是:
Other characters before this are printed out correctly. The code is:
parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse("filename.xml", parser)
root = tree.getroot()
for elem in root:
print elem[0].text
该错误是否表示它不是在utf-8中而是在Shift JIS中解析文件?
Does the error mean that it didn't parse the file in utf-8 but in shift JIS instead?
推荐答案
UnicodeEncodeError
的堆栈跟踪指向发生异常的位置.不幸的是您没有包括它,但很可能是将unicode文本打印到stdout的最后一行.我假设stdout在您的系统上使用 cp932
编码.
The stacktrace of the UnicodeEncodeError
points to the location where the exception occurs.
Unfortunately you didn’t include it but it’s most likely the last line where the unicode text is printed to stdout. I assume that stdout uses cp932
encoding on your system.
如果我的假设正确,则应考虑更改环境,以便stdout使用可以表示Unicode字符的编码(例如UTF-8).(例如,请参见通过python中的sys.stdout编写unicode字符串).
If my assumptions are correct you should consider changing your environment such that stdout uses an encoding that can represent unicode characters (like UTF-8). (see for example Writing unicode strings via sys.stdout in Python).
这篇关于解析utf8 xml时,lxml编码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!