如何使用ElementTree正确解析utf-8 xml? [英] How to correctly parse utf-8 xml with ElementTree?

查看:478
本文介绍了如何使用ElementTree正确解析utf-8 xml?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要帮助来理解为什么使用 xml.etree.ElementTree 解析xml文件*会产生以下错误。

I need help to understand why parsing my xml file* with xml.etree.ElementTree produces the following errors.

* 我的测试xml文件包含阿拉伯字符。

任务:
打开并解析 utf8_file.xml 文件。

我的第一个尝试:

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_tree = etree.parse(utf8_file)

结果1:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 236-238: ordinal not in range(128)

我的第二次尝试:

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_string = etree.tostring(utf8_file, encoding='utf-8', method='xml')
    xml_tree  = etree.fromstring(xml_string)

结果2:

AttributeError: 'file' object has no attribute 'getiterator'

请解释以上错误并评论可能的解决方案。

Please explain the errors above and comment on the possible solution.

推荐答案

将字节解码为解析器;先 not 解码:

Leave decoding the bytes to the parser; do not decode first:

import xml.etree.ElementTree as etree
with open('utf8_file.xml', 'r') as xml_file:
    xml_tree = etree.parse(xml_file)

XML文件必须在第一行中包含足够的信息以处理解析器的解码。如果缺少标头,则解析器必须假定已使用UTF-8。

An XML file must contain enough information in the first line to handle decoding by the parser. If the header is missing, the parser must assume UTF-8 is used.

由于是保存此信息的XML标头,因此解析器有责任完成所有解码。

Because it is the XML header that holds this information, it is the responsibility of the parser to do all decoding.

您的第一次尝试失败,因为Python再次尝试编码,以便解析器可以处理字节字符串预期。第二次尝试失败,因为 etree.tostring()希望将解析的树作为第一个参数,而不是unicode字符串。

Your first attempt failed because Python was trying to encode the Unicode values again so that the parser could handle byte strings as it expected. The second attempt failed because etree.tostring() expects a parsed tree as first argument, not a unicode string.

这篇关于如何使用ElementTree正确解析utf-8 xml?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆