借助自定义功能轻松实现 [英] Dask apply with custom function
问题描述
我正在尝试Dask,但是在分组后使用apply
时遇到了问题.
I am experimenting with Dask, but I encountered a problem while using apply
after grouping.
我有一个Dask DataFrame,其中包含大量行.让我们考虑下面的例子
I have a Dask DataFrame with a large number of rows. Let's consider for example the following
N=10000
df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) })
ddf = dd.from_pandas(df, npartitions=8)
我想对col_1
的值进行装箱,并遵循此处的解决方案a>
I want to bin the values of col_1
and I follow the solution from here
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1',bins,labels)
其中
def test_f(df,col,bins,labels):
return df.assign(bin_num = pd.cut(df[col],bins,labels=labels))
这和我预期的一样.
现在,我想在每个bin中取中值(取自此处)
Now I want to take the median value in each bin (taken from here)
median = ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute()
我有10个存储箱,我希望median
有10行,但实际上有80行.数据帧有8个分区,所以我猜想应用程序在某种程度上可以单独工作.
Having 10 bins, I expect median
to have 10 rows, but it actually has 80. The dataframe has 8 partitions so I guess that somehow the apply is working on each one individually.
但是,如果我想要平均值并使用mean
However, If I want the mean and use mean
median = ddf2.groupby('bin_num')['col_1'].mean().compute()
它有效,输出有10行.
it works and the output has 10 rows.
然后的问题是:我在做什么错,阻止了apply
用作mean
?
The question is then: what am I doing wrong that is preventing apply
from operating as mean
?
推荐答案
Maybe this warning is the key (Dask doc: SeriesGroupBy.apply) :
Pandas的groupby-apply可用于应用任意功能,包括导致每组一行一行的聚合. Dask的groupby-apply 将对每个分区组对应用一次func ,因此,当func减少时,您将在每个分区组对中排成一行.要对Dask应用自定义聚合,请使用dask.dataframe.groupby.Aggregation.
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
这篇关于借助自定义功能轻松实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!