pandas 集团以最大的总和 [英] Pandas groupby nlargest sum
问题描述
我尝试使用 groupby
, nlargest
和 sum
在Pandas中一起工作,但无法使其工作。
州县人口
阿拉巴马a 100
阿拉巴马州b 50
阿拉巴马州c 40
阿拉巴马州d 5
阿拉巴马州e 1
...
怀俄明州a.51 180
怀俄明州b.51 150
怀俄明州c.51 56
Wyoming d.51 5
我想用 groupby
按州选择,然后按人口排列前2个县。然后只用那3个县的人口数来得到该州的总和。
最后,我会列出一份将拥有州和人口(排名前2的县)的名单。
我可以得到 groupby
和 nlargest
,但获得 nlargest(2)
的总和是一个挑战。
我现在所在的行很简单: df.groupby('State')['Population']。nlargest(2) code>
执行完毕后,您可以使用 apply
groupby
:
df.groupby('State')['人口'] .apply(lambda grp:grp.nlargest(2).sum())
我认为你遇到的问题是 df.groupby('State')['Population']。nlargest(2)
会返回一个DataFrame,所以你不能再做组级操作。一般来说,如果你想在一个组中执行多个操作,你需要使用 apply
/ agg
。
结果输出:
州
阿拉巴马州150
Wyoming 330
编辑
一个稍微简洁的方法,如@cᴏʟᴅsᴘᴇᴇᴅ所示:
df.groupby('State' )['Population']。nlargest(2).sum(level = 0)
虽然比使用应用
在较大的DataFrame上慢很多。
使用以下设置:
将numpy作为np
导入pandas作为pd
从字符串导入ascii_letters
n = 10 ** 6
df = pd.DataFrame({'A':np.random.choice(list(ascii_letters),size = n),
'B':np.random.randint(10 ** 7,size =
$ b 我得到以下时间点:$ b
$ b
在[3]中:%timeit df.groupby('A')['B']。apply(lambda grp:grp.nlargest(2).sum())
103 ms±每回路1.08毫秒(平均值±标准差。开发。 7个运行,每个10个循环)
在[4]中:%timeit df.groupby('A')['B']。nlargest(2).sum(level = 0)
每个循环147毫秒±3.38毫秒(平均值±标准差7次,每次循环10次)
sum
执行第二个级别 kwarg可能会导致性能下降。 groupby
隐藏。
I am trying to use groupby
, nlargest
, and sum
functions in Pandas together, but having trouble making it work.
State County Population
Alabama a 100
Alabama b 50
Alabama c 40
Alabama d 5
Alabama e 1
...
Wyoming a.51 180
Wyoming b.51 150
Wyoming c.51 56
Wyoming d.51 5
I want to use groupby
to select by state, then get the top 2 counties by population. Then use only those top 3 county population numbers to get a sum for that state.
In the end, I'll have a list that will have the state and the population (of it's top 2 counties).
I can get the groupby
and nlargest
to work, but getting the sum of the nlargest(2)
is a challenge.
The line I have right now is simply: df.groupby('State')['Population'].nlargest(2)
解决方案 You can use apply
after performing the groupby
:
df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())
I think this issue you're having is that df.groupby('State')['Population'].nlargest(2)
will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply
/agg
.
The resulting output:
State
Alabama 150
Wyoming 330
EDIT
A slightly cleaner approach, as suggested by @cᴏʟᴅsᴘᴇᴇᴅ:
df.groupby('State')['Population'].nlargest(2).sum(level=0)
This is slightly slower than using apply
on larger DataFrames though.
Using the following setup:
import numpy as np
import pandas as pd
from string import ascii_letters
n = 10**6
df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
'B': np.random.randint(10**7, size=n)})
I get the following timings:
In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The slower performance is potentially caused by the level
kwarg in sum
performing a second groupby
under the hood.
这篇关于 pandas 集团以最大的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!