将功能应用于groupby功能 [英] apply a function to a groupby function

查看:91
本文介绍了将功能应用于groupby功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算groupby上有多少个一致的增加,以及第一个元素和最后一个元素之间的差异.但是我不能在groupby上应用该功能. groupby之后,它是一个列表吗?而且"apply"和"agg"之间有什么区别?抱歉,我刚接触python几天了.

I want to count how many consistent increase, and the difference between the first element and the last element, on a groupby. But I can't apply the function on the groupby. After groupby, is it a list? And also what's the difference between "apply" and "agg"? Sorry, I just touched the python for a few days.

def promotion(ls):
    pro =0
    if len(ls)>1:
        for j in range(1,len(ls)):
            if ls[j]>ls[j-1]:
                pro + = 1
    return pro
def growth(ls):
    head= ls[0]
    tail= ls[len(ls)-1]
    gro= tail-head
    return gro
titlePromotion= JobData.groupby("candidate_id")["TitleLevel"].apply(promotion)
titleGrowth= JobData.groupby("candidate_id")["TitleLevel"].apply(growth)

数据为:

candidate_id    TitleLevel     othercols
1                 2              foo
2                 1              bar
2                 2              goo
2                 1              gar
The result should be
titlePromotion
candidate_id 
1                  0
2                  1
titleGrowth
candidate_id
1               0
2               0

推荐答案

import pandas as pd

def promotion(ls):
    return (ls.diff() > 0).sum()

def growth(ls):
    return ls.iloc[-1] - ls.iloc[0]

jobData = pd.DataFrame(
    {'candidate_id': [1, 2, 2, 2],
     'TitleLevel': [2, 1, 2, 1]})

grouped = jobData.groupby("candidate_id")
titlePromotion = grouped["TitleLevel"].agg(promotion)
print(titlePromotion)
# candidate_id
# 1               0
# 2               1
# dtype: int64

titleGrowth = grouped["TitleLevel"].agg(growth)
print(titleGrowth)
# candidate_id
# 1               0
# 2               0
# dtype: int64


一些提示:


Some tips:

如果您定义通用函数

def foo(ls):
    print(type(ls))

并致电

jobData.groupby("candidate_id")["TitleLevel"].apply(foo)

Python将打印

<class 'pandas.core.series.Series'>

这是一种低调但有效的方法,可发现调用jobData.groupby(...)[...].apply(foo)Series传递给foo.

This is a low-brow but effective way to discover that calling jobData.groupby(...)[...].apply(foo) passes a Series to foo.

apply方法为每个组调用一次foo.它可以返回一个Series或一个DataFrame,并将结果块粘合在一起.当foo返回诸如数值或字符串之类的对象时,可以使用apply,但是在这种情况下,我认为首选使用agg.使用apply的典型用例是,例如,要对组中的每个值求平方,因此需要返回形状相同的新组.

The apply method calls foo once for every group. It can return a Series or a DataFrame with the resulting chunks glued together. It is possible to use apply when foo returns an object such as a numerical value or string, but in such cases I think using agg is preferred. A typical use case for using apply is when you want to, say, square every value in a group and thus need to return a new group of the same shape.

在这种情况下,transform方法也很有用-当您要对组中的每个值进行转换并因此需要返回相同形状的东西时-但结果可能是与apply有所不同,因为可能将不同的对象传递给foo(例如,使用transform时,分组数据帧的每一列都将传递给foo,而整个组将传递给foo c2>使用apply时.最简单的理解方法是尝试使用简单的数据框和通用的foo.

The transform method is also useful in this situation -- when you want to transform every value in the group and thus need to return something of the same shape -- but the result can be different than that with apply since a different object may be passed to foo (for example, each column of a grouped dataframe would be passed to foo when using transform, while the entire group would be passed to foo when using apply. The easiest way to understand this is to experiment with a simple dataframe and the generic foo.)

agg方法为每个组调用一次foo,但是与apply不同,它应为每个组返回一个数字.该组被聚合成一个值.使用agg的典型用例是当您要计算组中的项目数时.

The agg method calls foo once for every group, but unlike apply it should return a single number per group. The group is aggregated into a value. A typical use case for using agg is when you want to count the number of items in the group.

您可以使用通用的foo函数来调试并了解原始代码出了什么问题:

You can debug and understand what went wrong with your original code by using the generic foo function:

In [30]: grouped['TitleLevel'].apply(foo)
0    2
Name: 1, dtype: int64
--------------------------------------------------------------------------------
1    1
2    2
3    1
Name: 2, dtype: int64
--------------------------------------------------------------------------------
Out[30]: 
candidate_id
1               None
2               None
dtype: object

这向您显示了正在传递给foo的系列.请注意,在第二个系列中,索引值为1和2.因此,由于在第二个系列中没有带有值0的标签,因此ls[0]会引发一个KeyError.

This shows you the Series that are being passed to foo. Notice that in the second Series, then index values are 1 and 2. So ls[0] raises a KeyError, since there is no label with value 0 in the second Series.

您真正想要的是系列中的第一项.这就是iloc的目的.

What you really want is the first item in the Series. That is what iloc is for.

因此,总结起来,请使用ls[label]选择索引值为label的系列的行.使用ls.iloc[n]选择系列的第n行.

So to summarize, use ls[label] to select the row of a Series with index value of label. Use ls.iloc[n] to select the nth row of the Series.

因此,要用最少的更改来修正代码,您可以使用

Thus, to fix your code with a the least amount of change, you could use

def promotion(ls):
    pro =0
    if len(ls)>1:
        for j in range(1,len(ls)):
            if ls.iloc[j]>ls.iloc[j-1]:
                pro += 1
    return pro
def growth(ls):
    head= ls.iloc[0]
    tail= ls.iloc[len(ls)-1]
    gro= tail-head
    return gro

这篇关于将功能应用于groupby功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆