Python - 以奇怪的utf-16格式读取文本文件 [英] Python - read text file with weird utf-16 format

查看:1057
本文介绍了Python - 以奇怪的utf-16格式读取文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将文本文件读入python,但似乎使用了一些非常奇怪的编码。我尝试一下:

  file = open('data.txt','r')

行= file.readlines()

行[0:1]:
打印行,
打印line.split()
  0.0200197 1.97691 e-005 

['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00','\x001\ x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']

打印行工作正常,但在我尝试拆分行后,我可以将其转换为浮点数,看起来很疯狂。当然,当我尝试将这些字符串转换为浮点数时,会产生错误。任何关于我如何将这些转换成数字的想法?



如果您想尝试加载,我将示例数据文件放在这里:
https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt



我想简单地使用numpy.loadtxt或numpy.genfromtxt,但他们也不想处理这个疯狂的文件。

解决方案

我愿意打赌这是一个UTF-16-LE文件,你正在读取它的默认编码。



在UTF-16中,每个字符需要两个字节*如果您的字符全部为ASCII码,则表示UTF-16编码看起来像每个字符后缀\x00的ASCII编码。



要解决这个问题,只需解码数据:

  print line.decode('utf-16 -le')。split()

或者在文件级别使用io或编解码器模块:

  f ile = io.open('data.txt','r',encoding ='utf-16-le')






*这有点简单:每个BMP字符需要两个字节;每个非BMP字符变成替代对,两个代理中的每一个都占用两个字节。但你可能不在乎这些细节。


I'm trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:

file = open('data.txt','r')

lines = file.readlines()

for line in lines[0:1]:
    print line,
    print line.split()

Output:

0.0200197   1.97691e-005

['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00', '\x001\x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']

Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. Of course, when I try to convert those strings to floats, this produces an error. Any idea about how I can convert these back into numbers?

I put the sample datafile here if you would like to try to load it: https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt

I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file.

解决方案

I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is.

In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character.

To fix this, just decode the data:

print line.decode('utf-16-le').split()

Or do the same thing at the file level with the io or codecs module:

file = io.open('data.txt','r', encoding='utf-16-le')


* This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details.

这篇关于Python - 以奇怪的utf-16格式读取文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆