将dask组导出到csv [英] Export dask groups to csv
问题描述
我有一个大文件.它有40,955,924条线,大于13GB.我需要能够基于单个字段将该文件分离为单个文件,如果我使用的是pd.DataFrame
,我将使用以下代码:
I have a single, large, file. It has 40,955,924 lines and is >13GB. I need to be able to separate this file out into individual files based on a single field, if I were using a pd.DataFrame
I would use this:
for k, v in df.groupby(['id']):
v.to_csv(k, sep='\t', header=True, index=False)
但是,我得到了错误KeyError: 'Column not found: 0'
在dask中遍历GroupBy对象,对此错误有解决方案,但这需要使用熊猫来存储数据框的副本,而我不能这样做.在拆分此文件方面的任何帮助将不胜感激.
However, I get the error KeyError: 'Column not found: 0'
there is a solution to this specific error on Iterate over GroupBy object in dask, but this requires using pandas to store a copy of the dataframe, which I cannot do. Any help on splitting this file up would be greatly appreciated.
推荐答案
您要为此使用apply()
:
def do_to_csv(df):
df.to_csv(df.name, sep='\t', header=True, index=False)
return df
df.groupby(['id']).apply(do_to_csv, meta=df._meta).size.compute()
注意
-组密钥存储在数据框name
中
-我们返回数据框并提供meta
;这并不是必须的,但是您将需要在 something 上进行计算,并且可以很方便地确切知道那是什么东西.
-最终输出将是写入的行数.
Note
- the group key is stored in the dataframe name
- we return back the dataframe and supply a meta
; this is not really necessary, but you will need to compute on something and it's convenient to know exactly what that thing is
- the final output will be the number of rows written.
这篇关于将dask组导出到csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!