Pandas groupby nlargest sum [英] Pandas groupby nlargest sum

查看:41
本文介绍了Pandas groupby nlargest sum的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Pandas 中一起使用 groupbynlargestsum 函数,但无法使其正常工作.

I am trying to use groupby, nlargest, and sum functions in Pandas together, but having trouble making it work.

State    County    Population
Alabama  a         100
Alabama  b         50
Alabama  c         40
Alabama  d         5
Alabama  e         1
...
Wyoming  a.51      180
Wyoming  b.51      150
Wyoming  c.51      56
Wyoming  d.51      5

我想用groupby按州选择,然后按人口得到前2个县.然后仅使用前 2 个县的人口数字来计算该州的总和.

I want to use groupby to select by state, then get the top 2 counties by population. Then use only those top 2 county population numbers to get a sum for that state.

最后,我将列出一个包含州和人口(前 2 个县)的列表.

In the end, I'll have a list that will have the state and the population (of it's top 2 counties).

我可以让 groupbynlargest 工作,但是获得 nlargest(2) 的总和是一个挑战.

I can get the groupby and nlargest to work, but getting the sum of the nlargest(2) is a challenge.

我现在的代码很简单:df.groupby('State')['Population'].nlargest(2)

推荐答案

执行groupby后可以使用apply:

df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())

我认为您遇到的这个问题是 df.groupby('State')['Population'].nlargest(2) 将返回一个 DataFrame,因此您不能再进行分组级操作.一般来说,如果要在一个组中执行多个操作,则需要使用apply/agg.

I think this issue you're having is that df.groupby('State')['Population'].nlargest(2) will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply/agg.

结果输出:

State
Alabama    150
Wyoming    330

编辑

更简洁的方法,正如@cᴏʟᴅsᴘᴇᴇᴅ所建议的:

A slightly cleaner approach, as suggested by @cᴏʟᴅsᴘᴇᴇᴅ:

df.groupby('State')['Population'].nlargest(2).sum(level=0)

这比在较大的数据帧上使用 apply 稍微慢一些.

This is slightly slower than using apply on larger DataFrames though.

使用以下设置:

import numpy as np
import pandas as pd
from string import ascii_letters

n = 10**6
df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
                   'B': np.random.randint(10**7, size=n)})

我得到以下时间:

In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

较慢的性能可能是由 sum 中的 level kwarg 在后台执行第二个 groupby 造成的.

The slower performance is potentially caused by the level kwarg in sum performing a second groupby under the hood.

这篇关于Pandas groupby nlargest sum的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆