使用utf-8以外的编码在Python中解析XML [英] Parse XML in Python with encoding other than utf-8

查看：105 发布时间：2021/4/21 20:25:52 python xml character-encoding

本文介绍了使用utf-8以外的编码在Python中解析XML的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

关于如何在python中解析xml的任何线索，其中包括:encoding ='Windows-1255'?至少当XML标头中有一个不是"utf-8"或"ASCII"的"encoding"标记时，lxml.etree解析器甚至不会查看该字符串.

Any clue on how to parse xml in python that has: encoding='Windows-1255' in it? At least the lxml.etree parser won't even look at the string when there's an "encoding" tag in the XML header which isn't "utf-8" or "ASCII".

运行以下代码失败，并显示:

Running the following code fails with:

ValueError:不支持带有编码声明的Unicode字符串.请使用字节输入或不带XML片段声明.

from lxml import etree

parser = etree.XMLParser(encoding='utf-8')

def convert_xml_to_utf8(xml_str):
    tree = etree.fromstring(xml_str, parser=parser)
    if tree.docinfo.encoding == 'utf-8':
        # already in correct encoding, abort
        return xml_str
    decoded_str = xml_str.decode(tree.docinfo.encoding)
    utf8_encoded_str = decoded_str.encode('utf-8')
    tree = etree.fromstring(utf8_encoded_str)
    tree.docinfo.encoding = 'utf-8'
    return etree.tostring(tree, pretty_print = True, xml_declaration = True, encoding='UTF-8', standalone="yes")


data = '''<?xml version='1.0' encoding='Windows-1255'?><rss version="2.0"><channel ><title ><![CDATA[ynet - חדשות]]></title></channel></rss>'''
print(convert_xml_to_utf8(data))

推荐答案

data 是unicode str .错误是说不支持包含 encoding ="..." 声明的东西，因为据说 str 已经从其编码中解码出来，因此它也包含一个编码声明是不明确/荒谬的.它告诉您改用 bytes ，例如 data = b'< ...>'.大概您应该以二进制模式打开文件，从那里读取数据并让 etree 处理 encoding ="..." ，而不是在您的字符串中使用字符串代码，这使编码情况更加复杂.

data is a unicode str. The error is saying that such a thing which also contains an encoding="..." declaration is not supported, because a str is supposedly already decoded from its encoding and hence it's ambiguous/nonsensical that it would also contain an encoding declaration. It's telling you to use a bytes instead, e.g. data = b'<...>'. Presumably you should be opening a file in binary mode, read the data from there and let etree handle the encoding="...", instead of using string literals in your code, which complicates the encoding situation even further.

这很简单:

from xml.etree import ElementTree

#        open in binary mode ↓
with open('/tmp/test.xml', 'rb') as f:
    e = ElementTree.fromstring(f.read())

Etvoilà， e 包含已解析的文件，其编码(大概)已由 etree 根据文件的内部 encoding =正确解释.." 标头.

Et voilà, e contains your parsed file with the encoding having been (presumably) correctly interpreted by etree based on the file's internal encoding="..." header.

ElementTree 对此有一个快捷方法:

ElementTree in fact has a shortcut method for this:

e = ElementTree.parse('/tmp/test.xml')

这篇关于使用utf-8以外的编码在Python中解析XML的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用utf-8以外的编码在Python中解析XML [英] Parse XML in Python with encoding other than utf-8

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用utf-8以外的编码在Python中解析XML [英] Parse XML in Python with encoding other than utf-8

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭