有没有一种方法可以使每个组中最多的项目获得成功? [英] Is there a way to get the nlargest items per group in dask?
问题描述
我有以下数据集:
location category percent
A 5 100.0
B 3 100.0
C 2 50.0
4 13.0
D 2 75.0
3 59.0
4 13.0
5 4.0
我正在尝试获取按位置分组的数据框中类别中最大的项.即,如果我想要每个组的前2个最大百分比,则输出应为:
And I'm trying to get the nlargest items of category in dataframe grouped by location. i.e. If I want the top 2 largest percentages for each group the output should be:
location category percent
A 5 100.0
B 3 100.0
C 2 50.0
4 13.0
D 2 75.0
3 59.0
在熊猫中,使用pandas.core.groupby.SeriesGroupBy.nlargest
看起来相对简单,但是dask没有groupby的nlargest
函数.一直在玩apply
,但似乎无法使其正常工作.
It looks like in pandas this is relatively straight forward using pandas.core.groupby.SeriesGroupBy.nlargest
but dask doesn't have an nlargest
function for groupby. Have been playing around with apply
but can't seem to get it to work properly.
df.groupby(['location'].apply(lambda x: x['percent'].nlargest(2)).compute()
但是我只得到错误ValueError: Wrong number of items passed 0, placement implies 8
推荐答案
应用程序应该可以工作,但是您的语法有点差:
The apply should work, but your syntax is a little off:
In [11]: df
Out[11]:
Dask DataFrame Structure:
Unnamed: 0 location category percent
npartitions=1
int64 object int64 float64
... ... ... ...
Dask Name: from-delayed, 3 tasks
In [12]: df.groupby("location")["percent"].apply(lambda x: x.nlargest(2), meta=('x', 'f8')).compute()
Out[12]:
location
A 0 100.0
B 1 100.0
C 2 50.0
3 13.0
D 4 75.0
5 59.0
Name: x, dtype: float64
在熊猫中,您将.nlargest
和.rank
作为groupby方法,这将使您无需应用即可进行此操作:
In pandas you'd have .nlargest
and .rank
as groupby methods which would let you do this without the apply:
In [21]: df1
Out[21]:
location category percent
0 A 5 100.0
1 B 3 100.0
2 C 2 50.0
3 C 4 13.0
4 D 2 75.0
5 D 3 59.0
6 D 4 13.0
7 D 5 4.0
In [22]: df1.groupby("location")["percent"].nlargest(2)
Out[22]:
location
A 0 100.0
B 1 100.0
C 2 50.0
3 13.0
D 4 75.0
5 59.0
Name: percent, dtype: float64
Dask.dataframe涵盖了熊猫API的一小部分但使用率很高的部分.
此限制有两个原因:
Dask.dataframe covers a small but well-used portion of the pandas API.
This limitation is for two reasons:
- 熊猫API很大
- 某些操作实际上很难并行执行(例如sort).
这篇关于有没有一种方法可以使每个组中最多的项目获得成功?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!