借助自定义功能轻松实现 [英] Dask apply with custom function

查看:60
本文介绍了借助自定义功能轻松实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试Dask,但是在分组后使用apply时遇到了问题.

I am experimenting with Dask, but I encountered a problem while using apply after grouping.

我有一个Dask DataFrame,其中包含大量行.让我们考虑下面的例子

I have a Dask DataFrame with a large number of rows. Let's consider for example the following

N=10000
df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) })
ddf = dd.from_pandas(df, npartitions=8)

我想对col_1的值进行装箱,并遵循此处的解决方案a>

I want to bin the values of col_1 and I follow the solution from here

bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1',bins,labels)

其中

def test_f(df,col,bins,labels):
    return df.assign(bin_num = pd.cut(df[col],bins,labels=labels))

这和我预期的一样.

现在,我想在每个bin中取中值(取自此处)

Now I want to take the median value in each bin (taken from here)

median = ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute()

我有10个存储箱,我希望median有10行,但实际上有80行.数据帧有8个分区,所以我猜想应用程序在某种程度上可以单独工作.

Having 10 bins, I expect median to have 10 rows, but it actually has 80. The dataframe has 8 partitions so I guess that somehow the apply is working on each one individually.

但是,如果我想要平均值并使用mean

However, If I want the mean and use mean

median = ddf2.groupby('bin_num')['col_1'].mean().compute()

它有效,输出有10行.

it works and the output has 10 rows.

然后的问题是:我在做什么错,阻止了apply用作mean?

The question is then: what am I doing wrong that is preventing apply from operating as mean?

推荐答案

也许此警告是关键(

Maybe this warning is the key (Dask doc: SeriesGroupBy.apply) :

Pandas的groupby-apply可用于应用任意功能,包括导致每组一行一行的聚合. Dask的groupby-apply 将对每个分区组对应用一次func ,因此,当func减少时,您将在每个分区组对中排成一行.要对Dask应用自定义聚合,请使用dask.dataframe.groupby.Aggregation.

Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

这篇关于借助自定义功能轻松实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆