Python打开CSV文件与据称混合编码? [英] Python open CSV file with supposedly mixed encodings?

查看:461
本文介绍了Python打开CSV文件与据称混合编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用Python读取CSV文本文件(根据记事本++ ,没有BOM的UTF-8)。但是编码似乎有一个问题:

 
print(open(path,encoding =utf-8)。read )




编解码器无法解码字节 08xf


这个小字符似乎是问题:(全字符串:●•อีเปียขี้



如果我尝试UTF-16,则会显示一条消息:

 
#also尝试使用encode
print(open(path,encoding =utf-16)。read()。encode('utf- 8'))




非法UTF-16代理


即使我尝试使用自动编解码器查找程序打开它,我收到错误。

 def csv_unireader(f,encoding =utf-8):
for csv.reader(codecs.iterencode(codecs.iterdecode (f,encoding),utf-8)):
yield [e.decode(utf-8)for e in row]

我可以忽略什么?该文件包含Twitter文本,其中包含许多不同的字符是肯定的。但这在Python中不可能是如此困难的任务,只是读取/打印文件?



编辑 >

刚刚尝试使用此答案中的代码: http://stackoverflow.com/a / 14786752/45311

 
import csv

with open('source.csv',newline = b',b','b','b','b','b' pre>

这至少会在屏幕上打印一些行,但在某些行后也会抛出一个错误:


cp850.py ,第19行,encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap'编解码器不能编码位置62-63中的字符:
字符映射到


似乎自动使用 CP850 这是另一个编码...我不能理解这一切....

解决方案

你的python的版本是什么?
如果使用2.x尝试将导入粘贴在脚本开头:

 从__future__ import unicode_literals 

比尝试:

  print(open(path).read()。encode('utf-8'))

还有一个很棒的字符集检测工具: chardet
我希望它会帮助你。


I'm trying read a CSV textfile (UTF-8 without BOM according to Notepad++) using Python. However there seems to be a problem with encoding:

print(open(path, encoding="utf-8").read())

Codec can't decode byte 08xf

This little character seems to be the problem: (full string: "●• อีเปียขี้บ่น ت •●"), however I'm sure there will be more.

If I try UTF-16, then there is a message:

#also tried with encode
print(open(path, encoding="utf-16").read().encode('utf-8'))

Illegal UTF-16 surrogate

Even when I try opening it with an automatic codec finder I receive the error.

def csv_unireader(f, encoding="utf-8"):
    for row in csv.reader(codecs.iterencode(codecs.iterdecode(f, encoding), "utf-8")):
        yield [e.decode("utf-8") for e in row]

What am I overlooking? The file contains Twitter texts which contain a lot of different characters that's for sure. But this can't be such difficult task in Python, just reading/printing a file?

Edit:

Just tried using the code from this answer: http://stackoverflow.com/a/14786752/45311

import csv

with open('source.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

This at least prints some rows to the screen, but it also throws an error after some rows:

cp850.py, line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 62-63: character maps to

It seems to automatically use CP850 which is another encoding... I can't make sense out of all this....

解决方案

What is the version of your python? If use the 2.x try to paste the import at the beginning of your script:

from __future__ import unicode_literals

than try:

print(open(path).read().encode('utf-8'))

There is also a great tool for charset detections: chardet. I hope it'll help you.

这篇关于Python打开CSV文件与据称混合编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆