将dask组导出到csv [英] Export dask groups to csv

查看：283 发布时间：2020/5/24 1:24:59 python pandas pandas-groupby dask

本文介绍了将dask组导出到csv的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个大文件.它有40,955,924条线，大于13GB.我需要能够基于单个字段将该文件分离为单个文件，如果我使用的是pd.DataFrame，我将使用以下代码:

I have a single, large, file. It has 40,955,924 lines and is >13GB. I need to be able to separate this file out into individual files based on a single field, if I were using a pd.DataFrame I would use this:

for k, v in df.groupby(['id']):
    v.to_csv(k, sep='\t', header=True, index=False)

但是，我得到了错误KeyError: 'Column not found: 0' 在dask中遍历GroupBy对象，对此错误有解决方案，但这需要使用熊猫来存储数据框的副本，而我不能这样做.在拆分此文件方面的任何帮助将不胜感激.

However, I get the error KeyError: 'Column not found: 0' there is a solution to this specific error on Iterate over GroupBy object in dask, but this requires using pandas to store a copy of the dataframe, which I cannot do. Any help on splitting this file up would be greatly appreciated.

推荐答案

您要为此使用apply():

def do_to_csv(df):
    df.to_csv(df.name, sep='\t', header=True, index=False)
    return df

df.groupby(['id']).apply(do_to_csv, meta=df._meta).size.compute()

注意 -组密钥存储在数据框name中 -我们返回数据框并提供meta；这并不是必须的，但是您将需要在 something 上进行计算，并且可以很方便地确切知道那是什么东西. -最终输出将是写入的行数.

Note - the group key is stored in the dataframe name - we return back the dataframe and supply a meta; this is not really necessary, but you will need to compute on something and it's convenient to know exactly what that thing is - the final output will be the number of rows written.

这篇关于将dask组导出到csv的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将dask组导出到csv [英] Export dask groups to csv

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将dask组导出到csv [英] Export dask groups to csv

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭