pandas 更好的排序,分组和求和方法 [英] Pandas better way for Sorting, Grouping, Summing

查看:95
本文介绍了 pandas 更好的排序,分组和求和方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对Pandas来说是新手,所以想知道是否还有一种更泛泛的方式(对它进行硬币化!)来对一些数据进行排序,分组,然后对部分数据求和.问题是要找到一系列值中的3个最大值,然后仅求和.

New to Pandas so wondering if there is a more Pandithic (coining it!) way to sort some data, group it, and then sum part of it. The problem is to find the 3 largest values in a series of values and then sum only them.

census_cp是一个数据框,其中包含有关州县的信息.我当前的解决方案是:

census_cp is a dataframe with information about counties of states. My current solution is:

cen_sort = census_cp.groupby('STNAME').head(3)
cen_sort = cen_sort.groupby('STNAME').sum().sort_values(by='CENSUS2010POP', ascending=False).head(n=3)
cen_sort = cen_sort.reset_index()
print(cen_sort['STNAME'].values.tolist())

我特别想知道是否有更好的方法可以做到这一点,以及为什么我不能将总和放在上一行的末尾并链接在一起,这在我看来显然是关联的项目(获取每个项目的前三项)并将它们添加在一起).

Im specifically curious if there is a better way to do this as well as why i cant put the sum at the end of the previous line and chain together what seems to me to be obviously connected items (get the top 3 of each and add them together).

推荐答案

我认为您可以使用 sum 首先是groupby,然后是

I think you can use head with sum first with groupby and then nlargest:

df = census_cp.groupby('STNAME')
              .apply(lambda x: x.head(3).sum(numeric_only=True))
              .reset_index()
              .nlargest(3, 'CENSUS2010POP')

示例:

census_cp = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
                   'CENSUS2010POP':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})

print (census_cp)
    CENSUS2010POP STNAME
0               4      a
1               5      b
2               6      s
3               5      c
4               6      s
5               2      c
6               3      b
7               4      c
8               5      d
9               6      b
10              4      c
11              5      s
12              4      s
13              3      c
14              6      a
15              5      e


df = census_cp.groupby('STNAME') \
              .apply(lambda x: x.head(3).sum(numeric_only=True)) \
              .reset_index() \
              .nlargest(3, 'CENSUS2010POP')
print (df)
  STNAME  CENSUS2010POP
5      s             17
1      b             14
2      c             11

如果需要双顶3 nlargest 总和值使用:

If need double top 3 nlargest per groups and then nlargest of summed values use:

df1 = census_cp.groupby('STNAME')['CENSUS2010POP']
               .apply(lambda x: x.nlargest(3).sum())
               .nlargest(3)
               .reset_index()
print (df1)
  STNAME  CENSUS2010POP
0      s             17
1      b             14
2      c             13

或者:

df1 = census_cp.groupby('STNAME')['CENSUS2010POP'].nlargest(3)
               .groupby(level=0)
               .sum()
               .nlargest(3)
               .reset_index() 
print (df1)
  STNAME  CENSUS2010POP
0      s             17
1      b             14
2      c             13

这篇关于 pandas 更好的排序,分组和求和方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆