Python - 以奇怪的utf-16格式读取文本文件 [英] Python - read text file with weird utf-16 format

查看：1057 发布时间：2017/8/17 0:41:47 python numpy encoding utf-16le

本文介绍了Python - 以奇怪的utf-16格式读取文本文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将文本文件读入python，但似乎使用了一些非常奇怪的编码。我尝试一下：

  file = open（'data.txt'，'r'）
 
行= file.readlines（）
 
行[0：1]：
打印行，
打印line.split（）
   0.0200197 1.97691 e-005 
 
 ['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00'，'\x001\ x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00'] 
  
打印行工作正常，但在我尝试拆分行后，我可以将其转换为浮点数，看起来很疯狂。当然，当我尝试将这些字符串转换为浮点数时，会产生错误。任何关于我如何将这些转换成数字的想法？ 
 
 
 如果您想尝试加载，我将示例数据文件放在这里：
  https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt  
 
 
 我想简单地使用numpy.loadtxt或numpy.genfromtxt，但他们也不想处理这个疯狂的文件。
解决方案
我愿意打赌这是一个UTF-16-LE文件，你正在读取它的默认编码。
 
 
 在UTF-16中，每个字符需要两个字节*如果您的字符全部为ASCII码，则表示UTF-16编码看起来像每个字符后缀\x00的ASCII编码。
 
 
 要解决这个问题，只需解码数据：
  print line.decode（'utf-16 -le'）。split（）
  
或者在文件级别使用io或编解码器模块：
  f ile = io.open（'data.txt'，'r'，encoding ='utf-16-le'）
  
 
 
 
 
 
  *这有点简单：每个BMP字符需要两个字节;每个非BMP字符变成替代对，两个代理中的每一个都占用两个字节。但你可能不在乎这些细节。
 
I'm trying to read a text file into python, but it seems to use some very strange encoding. I try the usual: 
file = open('data.txt','r')

lines = file.readlines()

for line in lines[0:1]:
    print line,
    print line.split()
Output: 
0.0200197   1.97691e-005

['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00', '\x001\x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']
Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. Of course, when I try to convert those strings to floats, this produces an error. Any idea about how I can convert these back into numbers? 

I put the sample datafile here if you would like to try to load it: 
https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt

I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file.
 解决方案 
I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is.

In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character.

To fix this, just decode the data:
print line.decode('utf-16-le').split()
Or do the same thing at the file level with the io or codecs module:
file = io.open('data.txt','r', encoding='utf-16-le')




* This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details.

                        这篇关于Python  - 以奇怪的utf-16格式读取文本文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Python - 以奇怪的utf-16格式读取文本文件 [英] Python - read text file with weird utf-16 format

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python - 以奇怪的utf-16格式读取文本文件 [英] Python - read text file with weird utf-16 format

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭