pandas 集团以最大的总和 [英] Pandas groupby nlargest sum

查看:254
本文介绍了 pandas 集团以最大的总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用 groupby nlargest sum 在Pandas中一起工作,但无法使其工作。

 州县人口
阿拉巴马a 100
阿拉巴马州b 50
阿拉巴马州c 40
阿拉巴马州d 5
阿拉巴马州e 1
...
怀俄明州a.51 180
怀俄明州b.51 150
怀俄明州c.51 56
Wyoming d.51 5

我想用 groupby 按州选择,然后按人口排列前2个县。然后只用那3个县的人口数来得到该州的总和。

最后,我会列出一份将拥有州和人口(排名前2的县)的名单。



我可以得到 groupby nlargest ,但获得 nlargest(2)的总和是一个挑战。



我现在所在的行很简单: df.groupby('State')['Population']。nlargest(2) code>

解决方案

执行完毕后,您可以使用 apply groupby

  df.groupby('State')['人口'] .apply(lambda grp:grp.nlargest(2).sum())

我认为你遇到的问题是 df.groupby('State')['Population']。nlargest(2)会返回一个DataFrame,所以你不能再做组级操作。一般来说,如果你想在一个组中执行多个操作,你需要使用 apply / agg



结果输出:

 
阿拉巴马州150
Wyoming 330

编辑



一个稍微简洁的方法,如@cᴏʟᴅsᴘᴇᴇᴅ所示:

  df.groupby('State' )['Population']。nlargest(2).sum(level = 0)

虽然比使用应用在较大的DataFrame上慢很多。



使用以下设置:

 将numpy作为np 
导入pandas作为pd
从字符串导入ascii_letters

n = 10 ** 6
df = pd.DataFrame({'A':np.random.choice(list(ascii_letters),size = n),
'B':np.random.randint(10 ** 7,size =




$ b

我得到以下时间点:$ b​​
$ b

 在[3]中:%timeit df.groupby('A')['B']。apply(lambda grp:grp.nlargest(2).sum())
103 ms±每回路1.08毫秒(平均值±标准差。开发。 7个运行,每个10个循环)

在[4]中:%timeit df.groupby('A')['B']。nlargest(2).sum(level = 0)
每个循环147毫秒±3.38毫秒(平均值±标准差7次,每次循环10次)



sum 执行第二个级别 kwarg可能会导致性能下降。 groupby 隐藏。


I am trying to use groupby, nlargest, and sum functions in Pandas together, but having trouble making it work.

State    County    Population
Alabama  a         100
Alabama  b         50
Alabama  c         40
Alabama  d         5
Alabama  e         1
...
Wyoming  a.51      180
Wyoming  b.51      150
Wyoming  c.51      56
Wyoming  d.51      5

I want to use groupby to select by state, then get the top 2 counties by population. Then use only those top 3 county population numbers to get a sum for that state.

In the end, I'll have a list that will have the state and the population (of it's top 2 counties).

I can get the groupby and nlargest to work, but getting the sum of the nlargest(2) is a challenge.

The line I have right now is simply: df.groupby('State')['Population'].nlargest(2)

解决方案

You can use apply after performing the groupby:

df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())

I think this issue you're having is that df.groupby('State')['Population'].nlargest(2) will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply/agg.

The resulting output:

State
Alabama    150
Wyoming    330

EDIT

A slightly cleaner approach, as suggested by @cᴏʟᴅsᴘᴇᴇᴅ:

df.groupby('State')['Population'].nlargest(2).sum(level=0)

This is slightly slower than using apply on larger DataFrames though.

Using the following setup:

import numpy as np
import pandas as pd
from string import ascii_letters

n = 10**6
df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
                   'B': np.random.randint(10**7, size=n)})

I get the following timings:

In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The slower performance is potentially caused by the level kwarg in sum performing a second groupby under the hood.

这篇关于 pandas 集团以最大的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆