pandas 数据帧read_csv上的不良数据 [英] Pandas dataframe read_csv on bad data

查看:138
本文介绍了 pandas 数据帧read_csv上的不良数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想读取一个非常大的csv(无法在excel中打开并轻松编辑),但是在第100,000行的某处,有一行带有一个额外的列,导致程序崩溃.该行是错误的,因此我需要一种方法来忽略它是多余的列这一事实.大约有50列,因此最好不要对标题进行硬编码,而使用名称或usecols是可取的.我也可能会在其他csv中遇到此问题,并希望有一个通用的解决方案.不幸的是,我在read_csv中找不到任何东西.代码很简单:

I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the program to crash. This row is errored so I need a way to ignore the fact that it was an extra column. There is around 50 columns so hardcoding the headers and using names or usecols isn't preferable. I'll also possibly encounter this issue in other csv's and want a generic solution. I couldn't find anything in read_csv unfortunately. The code is as simple as this:

def loadCSV(filePath):
    dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000)
    datakeys = dataframe.keys();
    return dataframe, datakeys

推荐答案

通过

pass error_bad_lines=False to skip erroneous rows:

error_bad_lines:布尔值,具有太多字段的默认True Lines (例如,包含太多逗号的csv行)在默认情况下会导致 引发异常,并且不会返回任何DataFrame.如果为假, 然后这些坏行"将从DataFrame中删除 回来. (仅对C解析器有效)

error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these "bad lines" will dropped from the DataFrame that is returned. (Only valid with C parser)

这篇关于 pandas 数据帧read_csv上的不良数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆