将dask组导出到csv [英] Export dask groups to csv

查看:283
本文介绍了将dask组导出到csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大文件.它有40,955,924条线,大于13GB.我需要能够基于单个字段将该文件分离为单个文件,如果我使用的是pd.DataFrame,我将使用以下代码:

I have a single, large, file. It has 40,955,924 lines and is >13GB. I need to be able to separate this file out into individual files based on a single field, if I were using a pd.DataFrame I would use this:

for k, v in df.groupby(['id']):
    v.to_csv(k, sep='\t', header=True, index=False)

但是,我得到了错误KeyError: 'Column not found: 0' 在dask中遍历GroupBy对象,对此错误有解决方案,但这需要使用熊猫来存储数据框的副本,而我不能这样做.在拆分此文件方面的任何帮助将不胜感激.

However, I get the error KeyError: 'Column not found: 0' there is a solution to this specific error on Iterate over GroupBy object in dask, but this requires using pandas to store a copy of the dataframe, which I cannot do. Any help on splitting this file up would be greatly appreciated.

推荐答案

您要为此使用apply():

def do_to_csv(df):
    df.to_csv(df.name, sep='\t', header=True, index=False)
    return df

df.groupby(['id']).apply(do_to_csv, meta=df._meta).size.compute()

注意 -组密钥存储在数据框name中 -我们返回数据框并提供meta;这并不是必须的,但是您将需要在 something 上进行计算,并且可以很方便地确切知道那是什么东西. -最终输出将是写入的行数.

Note - the group key is stored in the dataframe name - we return back the dataframe and supply a meta; this is not really necessary, but you will need to compute on something and it's convenient to know exactly what that thing is - the final output will be the number of rows written.

这篇关于将dask组导出到csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆