在python中读取具有恶意字节0xc0的文件,该文件会导致utf-8和ascii错误输出 [英] Read a file in python having rogue byte 0xc0 that causes utf-8 and ascii to error out

查看:287
本文介绍了在python中读取具有恶意字节0xc0的文件,该文件会导致utf-8和ascii错误输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试将制表符分隔的文件读入pandas数据框:

Trying to read a tab-separated file into pandas dataframe:

>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False)

它会像这样出错:

b'Skipping line 58: expected 11 fields, saw 12\n'
Traceback (most recent call last):
...(many lines)...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 115: invalid start byte

似乎字节0xc0在utf-8和ascii编码上都会引起痛苦.

It seems the byte 0xc0 causes pain at both utf-8 and ascii encodings.

>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ascii')
...(many lines)...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 115: ordinal not in range(128)

我也遇到了csv模块阅读器的相同问题.
如果我将文件导入到OpenOffice Calc中,则会正确导入文件,可以正确识别列,等等.可能会在此处忽略有问题的0xc0字节.这不是数据等重要的部分,它可能只是生成此文件的系统的fl幸写入错误.如果涉及到这一点,我什至乐于改变他发生的那条线.我只想将文件读入python程序.熊猫的error_bad_lines=False选项应该已经解决了这个问题,但是没有骰子.另外,该文件在非英语脚本中不包含任何使unicode成为必需的内容.这都是标准的英文字母和数字.我也尝试过utf-16 utf-32等,但它们只会造成更多的错误.

I ran into the same issues with csv module's reader too.
If I import the file into OpenOffice Calc, it gets imported properly, the columns are properly recognized etc. Probably the offending 0xc0 byte is ignored there. This is not some vital piece of the data etc, it's probably just a fluke write error by the system that generated this file. I'll be happy to even zap the line where his occurs if it comes to that. I just want to read the file into the python program. The error_bad_lines=False option of pandas ought to have taken care of this problem but no dice. Also, the file does NOT have any content in non-english scripts that makes unicode so necessary. It's all standard english letters and numbers. I tried utf-16 utf-32 etc too but they only caused more errors of their own.

如何使python(尤其是pandas Dataframe)读取具有一个或多个恶意字节0xc0字符的文件?

How to make python (pandas Dataframe in particular) read a file having one or more rogue byte 0xc0 characters?

推荐答案

将此答案移至此处从另一个地方得到了敌对的接待.

Moving this answer here from another place where it got a hostile reception.

找到了一个标准,该标准实际上接受(意味着不会出错)字节0xc0:

Found one standard that actually accepts (meaning, doesn't error out) byte 0xc0 :

encoding="ISO-8859-1"  

注意:这需要确保文件的其余部分没有Unicode字符.这可能对像我这样的人有用,他们无论如何在文件中都没有任何unicode字符,只希望python加载该死的东西,而utf-8和ascii编码都出错了.

Note: This entails making sure the rest of the file doesn't have unicode chars. This may be helpful for folks like me who didn't have any unicode chars in their file anyways and just wanted python to load the damn thing and both utf-8 and ascii encodings were erroring out.

有关ISO-8859-1的更多信息: UTF-8和ISO-8859-1有什么区别?

More on ISO-8859-1 : What is the difference between UTF-8 and ISO-8859-1?

有效的新命令:

>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ISO-8859-1')

读入后,数据框很好,列和数据都像在OpenOffice Calc中一样工作.我仍然不知道有问题的0xc0字节到哪里去了,但这无关紧要,因为我已经获得了所需的数据.

After reading it in, the dataframe is fine, the columns, data are all working like they did in OpenOffice Calc. I still have no idea where the offending 0xc0 byte went but it doesn't matter as I've got the data I needed.

这篇关于在python中读取具有恶意字节0xc0的文件,该文件会导致utf-8和ascii错误输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆