达斯克的基本groupby操作 [英] basic groupby operations in Dask

查看:114
本文介绍了达斯克的基本groupby操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Dask处理大文件(50 gb).通常,我会将其加载到内存中并使用Pandas.我想对"A"和"B"两列进行分组,每当"C"列以一个值开头时,我都希望在该列中为该特定组重复该值.

I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group.

在熊猫中,我将执行以下操作:

In pandas, I would do the following:

df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')

达斯克相当于什么? 另外,我对如何在达斯克(Dask)而不是熊猫(Pandas)中解决问题感到迷茫,

What would be the equivalent in Dask? Also, I am a little bit lost as to how to structure problems in Dask as opposed to in Pandas,

谢谢

我到目前为止的进展:

第一组索引:

df1 = df.set_index(['A','B'])

然后分组依据:

df1.groupby(['A','B']).apply(lambda x: x.fillna(method='ffill').compute()

推荐答案

看来dask当前未为GroupBy对象实现fillna方法.我曾尝试过PRing,但很快就放弃了.

It appears dask does not currently implement the fillna method for GroupBy objects. I've tried PRing it some time ago and gave up quite quickly.

此外,dask不支持method参数(因为使用延迟算法实现起来并不总是很简单).

Also, dask doesn't support the method parameter (as it isn't always trivial to implement with delayed algorithms).

一种解决方法是在分组之前使用fillna,如下所示:

A workaround for this could be using fillna before grouping, like so:

df['C'] = df.fillna(0).groupby(['A','B'])['C']

尽管没有经过测试.

您可以在这里找到我的(失败的)尝试: https://github.com/nirizr/dask/tree/groupy_fillna

You can find my (failed) attempt here: https://github.com/nirizr/dask/tree/groupy_fillna

这篇关于达斯克的基本groupby操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆