有没有一种方法可以使每个组中最多的项目获得成功? [英] Is there a way to get the nlargest items per group in dask?

查看:66
本文介绍了有没有一种方法可以使每个组中最多的项目获得成功?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据集:

location  category    percent
A         5           100.0
B         3           100.0
C         2            50.0
          4            13.0
D         2            75.0
          3            59.0
          4            13.0
          5             4.0

我正在尝试获取按位置分组的数据框中类别中最大的项.即,如果我想要每个组的前2个最大百分比,则输出应为:

And I'm trying to get the nlargest items of category in dataframe grouped by location. i.e. If I want the top 2 largest percentages for each group the output should be:

location  category    percent
A         5           100.0
B         3           100.0
C         2            50.0
          4            13.0
D         2            75.0
          3            59.0

在熊猫中,使用pandas.core.groupby.SeriesGroupBy.nlargest看起来相对简单,但是dask没有groupby的nlargest函数.一直在玩apply,但似乎无法使其正常工作.

It looks like in pandas this is relatively straight forward using pandas.core.groupby.SeriesGroupBy.nlargest but dask doesn't have an nlargest function for groupby. Have been playing around with apply but can't seem to get it to work properly.

df.groupby(['location'].apply(lambda x: x['percent'].nlargest(2)).compute()

但是我只得到错误ValueError: Wrong number of items passed 0, placement implies 8

推荐答案

应用程序应该可以工作,但是您的语法有点差:

The apply should work, but your syntax is a little off:

In [11]: df
Out[11]:
Dask DataFrame Structure:
              Unnamed: 0 location category  percent
npartitions=1
                   int64   object    int64  float64
                     ...      ...      ...      ...
Dask Name: from-delayed, 3 tasks

In [12]: df.groupby("location")["percent"].apply(lambda x: x.nlargest(2), meta=('x', 'f8')).compute()
Out[12]:
location
A         0    100.0
B         1    100.0
C         2     50.0
          3     13.0
D         4     75.0
          5     59.0
Name: x, dtype: float64

在熊猫中,您将.nlargest.rank作为groupby方法,这将使您无需应用即可进行此操作:

In pandas you'd have .nlargest and .rank as groupby methods which would let you do this without the apply:

In [21]: df1
Out[21]:
  location  category  percent
0        A         5    100.0
1        B         3    100.0
2        C         2     50.0
3        C         4     13.0
4        D         2     75.0
5        D         3     59.0
6        D         4     13.0
7        D         5      4.0

In [22]: df1.groupby("location")["percent"].nlargest(2)
Out[22]:
location
A         0    100.0
B         1    100.0
C         2     50.0
          3     13.0
D         4     75.0
          5     59.0
Name: percent, dtype: float64

模糊的文档注释:

Dask.dataframe涵盖了熊猫API的一小部分但使用率很高的部分.
此限制有两个原因:

Dask.dataframe covers a small but well-used portion of the pandas API.
This limitation is for two reasons:

  1. 熊猫API很大
  2. 某些操作实际上很难并行执行(例如sort).

这篇关于有没有一种方法可以使每个组中最多的项目获得成功?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆