达斯克的基本groupby操作 [英] basic groupby operations in Dask
问题描述
我正在尝试使用Dask处理大文件(50 gb).通常,我会将其加载到内存中并使用Pandas.我想对"A"和"B"两列进行分组,每当"C"列以一个值开头时,我都希望在该列中为该特定组重复该值.
I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group.
在熊猫中,我将执行以下操作:
In pandas, I would do the following:
df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')
达斯克相当于什么? 另外,我对如何在达斯克(Dask)而不是熊猫(Pandas)中解决问题感到迷茫,
What would be the equivalent in Dask? Also, I am a little bit lost as to how to structure problems in Dask as opposed to in Pandas,
谢谢
我到目前为止的进展:
第一组索引:
df1 = df.set_index(['A','B'])
然后分组依据:
df1.groupby(['A','B']).apply(lambda x: x.fillna(method='ffill').compute()
推荐答案
看来dask当前未为GroupBy
对象实现fillna
方法.我曾尝试过PRing,但很快就放弃了.
It appears dask does not currently implement the fillna
method for GroupBy
objects. I've tried PRing it some time ago and gave up quite quickly.
此外,dask不支持method
参数(因为使用延迟算法实现起来并不总是很简单).
Also, dask doesn't support the method
parameter (as it isn't always trivial to implement with delayed algorithms).
一种解决方法是在分组之前使用fillna
,如下所示:
A workaround for this could be using fillna
before grouping, like so:
df['C'] = df.fillna(0).groupby(['A','B'])['C']
尽管没有经过测试.
您可以在这里找到我的(失败的)尝试: https://github.com/nirizr/dask/tree/groupy_fillna
You can find my (failed) attempt here: https://github.com/nirizr/dask/tree/groupy_fillna
这篇关于达斯克的基本groupby操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!