Python-使用BOM解码UTF-16文件 [英] Python - Decode UTF-16 file with BOM

查看:378
本文介绍了Python-使用BOM解码UTF-16文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 UTF-16 LE 文件> BOM .我想将此文件转换为UTF-8(不带BOM),以便可以使用Python进行解析.

I have a UTF-16 LE file with BOM. I'd like to flip this file in to UTF-8 without BOM so I can parse it using Python.

我使用的普通代码并不能解决问题,它返回未知字符,而不是实际文件内容.

The usual code that I use didn't do the trick, it returned unknown characters instead of the actual file contents.

f = open('dbo.chrRaces.Table.sql').read()
f = str(f).decode('utf-16le', errors='ignore').encode('utf8')
print f

解码此文件的正确方法是什么,以便我可以使用f.readlines()对其进行解析?

What would be the proper way to decode this file so I can parse through it with f.readlines()?

推荐答案

首先,您应该以二进制模式阅读,否则事情会变得混乱.

Firstly, you should read in binary mode, otherwise things will get confusing.

然后,检查并删除BOM,因为它是文件的一部分,而不是实际文本的一部分.

Then, check for and remove the BOM, since it is part of the file, but not part of the actual text.

import codecs
encoded_text = open('dbo.chrRaces.Table.sql', 'rb').read()    #you should read in binary mode to get the BOM correctly
bom= codecs.BOM_UTF16_LE                                      #print dir(codecs) for other encodings
assert encoded_text.startswith(bom)                           #make sure the encoding is what you expect, otherwise you'll get wrong data
encoded_text= encoded_text[len(bom):]                         #strip away the BOM
decoded_text= encoded_text.decode('utf-16le')                 #decode to unicode

在完成所有解析/处理之前,请勿编码(以utf-8或其他方式编码).您应该使用unicode字符串完成所有操作.

Don't encode (to utf-8 or otherwise) until you're done with all parsing/processing. You should do all that using unicode strings.

此外,decode上的errors='ignore'可能不是一个好主意.考虑更糟的事情:让程序告诉您什么地方有问题然后停止,还是返回错误的数据?

Also, errors='ignore' on decode may be a bad idea. Consider what's worse: having your program tell you something is wrong and stop, or returning wrong data?

这篇关于Python-使用BOM解码UTF-16文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆