阅读XML标题编码 [英] Reading XML header encoding

查看:140
本文介绍了阅读XML标题编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些我想用脚本处理的XML文件,将它们从他们所在的编码转换成UTF-8。



使用这个好的答案中给出的代码我可以进行转换,但是如何读取XML标头中给出的编码? / p>

例如,我有很多已经在UTF-8中的文件,应该是单独的:

 <?xml version =1.0encoding =utf-8?> 

但是,我有很多文件需要转换:

 <?xml version =1.0encoding =windows-1255?> 

如何在Python中检测这些文件标题中指定的XML编码?更好的是,在我检测并重新编码文件之后,如何才能将此XML标题更改为utf-8,以避免将来进行处理?

解决方案

使用 lxml 来执行解析;然后,您可以使用以下方式访问原始编码:

  from lxml import etree 

with open(filename, 'r')作为xmlfile:
tree = etree.parse(xmlfile)
如果tree.docinfo.encoding =='utf-8':
#已经在正确的编码,中止
return

然后您可以使用 lxml 在UTF-8中再次写入文件。


I have a number of XML files I'd like to process with a script, converting them from whatever encoding that they're in to UTF-8.

Using the code given in this great answer I can do the conversion, but how can I read the encoding given in the XML header?

For example, I have many files which are already in UTF-8, which should be left alone:

<?xml version="1.0" encoding="utf-8"?>

However, I have a lot of files which do need to be converted:

<?xml version="1.0" encoding="windows-1255"?>

How can I detect the XML encoding specified in the headers of these files in Python? Better, after I detect and reencode the files, how then can I change this XML header to read "utf-8" to avoid processing it in the future?

解决方案

Use lxml to do the parsing; you can then access the original encoding with:

from lxml import etree

with open(filename, 'r') as xmlfile:
    tree = etree.parse(xmlfile)
    if tree.docinfo.encoding == 'utf-8':
        # already in correct encoding, abort
        return

You can then use lxml to write the file out again in UTF-8.

这篇关于阅读XML标题编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆