跳过read_csv中缺少值的行 [英] Skip rows with missing values in read_csv

查看：95 发布时间：2020/5/24 0:41:07 python pandas

本文介绍了跳过read_csv中缺少值的行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个非常大的csv，需要读入.为了快速实现并节省RAM使用量，我正在使用read_csv并将某些列的dtype设置为np.uint32.问题在于某些行缺少值，而熊猫使用浮点数来表示这些值.

I have a very large csv which I need to read in. To make this fast and save RAM usage I am using read_csv and set the dtype of some columns to np.uint32. The problem is that some rows have missing values and pandas uses a float to represent those.

是否可以简单地跳过缺少值的行?我知道读取完整个文件后就可以执行此操作，但这意味着直到那时我都无法设置dtype，因此会占用过多的RAM.
是否可以将丢失的值转换为在读取数据期间选择的其他值?

推荐答案

如果您可以在读取过程中用0填充NaN，那就太好了.也许熊猫的git-hub中的功能请求是按顺序进行的...

It would be dainty if you could fill NaN with say 0 during read itself. Perhaps a feature request in Pandas's git-hub is in order...

但是，暂时，您可以定义自己的函数来执行此操作，并将其传递给

However, for the time being, you can define your own function to do that and pass it to the converters argument in read_csv:

def conv(val):
    if val == np.nan:
        return 0 # or whatever else you want to represent your NaN with
    return val

df = pd.read_csv(file, converters={colWithNaN : conv}, dtypes=...)

请注意，converters需要一个dict，因此您需要为要处理NaN的每一列指定它.如果影响许多色谱柱，可能会有些麻烦.您可以指定列名或数字作为键.

Note that converters takes a dict, so you need to specify it for each column that has NaN to be dealt with. It can get a little tiresome if a lot of columns are affected. You can specify either column names or numbers as keys.

还请注意，这可能会降低read_csv的性能，具体取决于converters函数的处理方式.此外，如果只有一列在读取过程中需要处理NaN，则可以跳过适当的函数定义，而使用lambda函数:

Also note that this might slow down your read_csv performance, depending on how the converters function is handled. Further, if you just have one column that needs NaNs handled during read, you can skip a proper function definition and use a lambda function instead:

df = pd.read_csv(file, converters={colWithNaN : lambda x: 0 if x == np.nan else x}, dtypes=...)

分块读取

您还可以将文件拼接在一起阅读，以获取最终输出.您可以通过这种方式做很多事情.这是一个说明性示例:

Reading in chunks

You could also read the file in small chunks that you stitch together to get your final output. You can do a bunch of things this way. Here is an illustrative example:

result = pd.DataFrame()
df = pd.read_csv(file, chunksize=1000)
for chunk in df:
    chunk.dropna(axis=0, inplace=True) # Dropping all rows with any NaN value
    chunk[colToConvert] = chunk[colToConvert].astype(np.uint32)
    result = result.append(chunk)
del df, chunk

请注意，此方法并不严格复制数据.有时候chunk中的数据存在两次，恰好在result.append语句之后，但是仅重复chunksize行，这是一个合理的讨价还价.这种方法可能比使用转换器功能更快.

Note that this method does not strictly duplicate data. There is a time when the data in chunk exists twice, right after the result.append statement, but only chunksize rows are repeated, which is a fair bargain. This method may also work out to be faster than by using a converter function.

这篇关于跳过read_csv中缺少值的行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

跳过read_csv中缺少值的行 [英] Skip rows with missing values in read_csv

问题描述

推荐答案

分块读取

Reading in chunks

相关文章

Python最新文章

热门教程

热门工具

登录关闭

跳过read_csv中缺少值的行 [英] Skip rows with missing values in read_csv

问题描述

推荐答案

分块读取

Reading in chunks

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭