阅读XML标题编码 [英] Reading XML header encoding
问题描述
使用这个好的答案中给出的代码我可以进行转换,但是如何读取XML标头中给出的编码? / p>
例如,我有很多已经在UTF-8中的文件,应该是单独的:
<?xml version =1.0encoding =utf-8?>
但是,我有很多文件需要转换:
<?xml version =1.0encoding =windows-1255?>
如何在Python中检测这些文件标题中指定的XML编码?更好的是,在我检测并重新编码文件之后,如何才能将此XML标题更改为utf-8,以避免将来进行处理?
使用 lxml
来执行解析;然后,您可以使用以下方式访问原始编码:
from lxml import etree
with open(filename, 'r')作为xmlfile:
tree = etree.parse(xmlfile)
如果tree.docinfo.encoding =='utf-8':
#已经在正确的编码,中止
return
然后您可以使用 lxml
在UTF-8中再次写入文件。
I have a number of XML files I'd like to process with a script, converting them from whatever encoding that they're in to UTF-8.
Using the code given in this great answer I can do the conversion, but how can I read the encoding given in the XML header?
For example, I have many files which are already in UTF-8, which should be left alone:
<?xml version="1.0" encoding="utf-8"?>
However, I have a lot of files which do need to be converted:
<?xml version="1.0" encoding="windows-1255"?>
How can I detect the XML encoding specified in the headers of these files in Python? Better, after I detect and reencode the files, how then can I change this XML header to read "utf-8" to avoid processing it in the future?
Use lxml
to do the parsing; you can then access the original encoding with:
from lxml import etree
with open(filename, 'r') as xmlfile:
tree = etree.parse(xmlfile)
if tree.docinfo.encoding == 'utf-8':
# already in correct encoding, abort
return
You can then use lxml
to write the file out again in UTF-8.
这篇关于阅读XML标题编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!