Pandas:使用 DataFrameGroupBy.filter() 方法选择值大于相应组均值的 DataFrame 行 [英] Pandas: Use DataFrameGroupBy.filter() method to select DataFrame's rows with a value greater than the mean of the respective group

查看:164
本文介绍了Pandas:使用 DataFrameGroupBy.filter() 方法选择值大于相应组均值的 DataFrame 行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习 Python 和 Pandas,我正在做一些练习来了解事情是如何运作的.我的问题如下:我可以使用 GroupBy.filter() 方法选择值(在特定列中)大于相应组平均值的 DataFrame 行吗?

I am learning Python and Pandas and I am doing some exercises to understand how things work. My question is the following: can I use the GroupBy.filter() method to select the DataFrame's rows that have a value (in a specific column) greater than the mean of the respective group?

在本练习中,我使用 Seaborn 中包含的行星"数据集:1035 行 x 6 列(列名:方法"、数字"、轨道周期"、质量"、距离"、年份"").

For this exercise, I am using the "planets" dataset included in Seaborn: 1035 rows x 6 columns (column names: "method", "number", "orbital_period", "mass", "distance", "year").

在蟒蛇中:

import pandas as pd
import seaborn as sns

#Load the "planets" dataset included in Seaborn
data = sns.load_dataset("planets")

#Remove rows with NaN in "orbital_period"
data = data.dropna(how = "all", subset = ["orbital_period"])

#Set display of DataFrames for seeing all the columns:
pd.set_option("display.max_columns", 15)

#Group the DataFrame "data" by "method" ()
group1 = data.groupby("method")
#I obtain a DataFrameGroupBy object (group1) composed of 10 groups.
print(group1)
#Print the composition of the DataFrameGroupBy object "group1".
for lab, datafrm in group1:
    print(lab, "\n", datafrm, sep="", end="\n\n")
print()
print()
print()


#Define the filter_function that will be used by the filter method.
#I want a function that returns True whenever the "orbital_period" value for 
#a row is greater than the mean of the corresponding group's mean.
#This could have been done also directly with "lambda syntax" as argument
#of filter().
def filter_funct(x):
    #print(type(x))
    #print(x)
    return x["orbital_period"] > x["orbital_period"].mean()


dataFiltered = group1.filter(filter_funct)
print("RESULT OF THE FILTER METHOD:")
print()
print(dataFiltered)
print()
print()

不幸的是,当我运行脚本时出现以下错误.

Unluckily, I obtain the following error when I run the script.

TypeError: filter function returned a Series, but expected a scalar bool

看起来 x["orbital_period"] 的行为不像向量,这意味着它不返回系列的单个值...奇怪的是,transform() 方法没有遇到这个问题.如果我运行以下命令,确实在同一个数据集(如上准备)上:

It looks like x["orbital_period"] does not behave as a vector, meaning that it does not return the single values of the Series... Weirdly enough the transform() method does not suffer from this problem. Indeed on the same dataset (prepared as above) if I run the following:

#Define the transform_function that will be used by the transform() method.
#I want this function to subtract from each value in "orbital_period" the mean
#of the corresponding group.
def transf_funct(x):
    #print(type(x))
    #print(x)
    return x-x.mean()

print("Transform method runs:")
print()
#I directly assign the transformed values to the "orbital_period" column of the DataFrame.
data["orbital_period"] = group1["orbital_period"].transform(transf_funct)
print("RESULT OF THE TRANSFORM METHOD:")
print()
print(data)
print()
print()
print()

我得到了预期的结果...

I obtain the expected result...

DataFrameGroupBy.filter() 和 DataFrameGroupBy.transform() 有不同的行为吗?我知道我可以通过许多其他方式实现我想要的,但我的问题是:有没有办法使用 DataFrameGroupBy.filter() 方法来实现我想要的?

Do DataFrameGroupBy.filter() and DataFrameGroupBy.transform() have different behavior? I know I can achieve what I want in many other ways but my question is: Is there a way to achieve what I want making use of the DataFrameGroupBy.filter() method?

推荐答案

我可以使用 DataFrameGroupBy.filter 排除组内的特定行?

答案是.DataFrameGroupBy.filter 使用单个 布尔值来表征整个组.如果特征为False,过滤的结果是去除整体.

Can I use DataFrameGroupBy.filter to exclude specific rows within a group?

The answer is No. DataFrameGroupBy.filter uses a single Boolean value to characterize an entire group. The result of the filtering is to remove the entirety of a group if it is characterized as False.

DataFrameGroupBy.filter 很慢,因此通常建议使用 transform 将单个真值广播到组内的所有行然后对 DataFrame1 进行子集化.这是删除平均值 <= 50 的整个组的示例. filter 方法慢 100 倍.

DataFrameGroupBy.filter is very slow, so it's often advised to use transform to broadcast the single truth value to all rows within a group and then to subset the DataFrame1. Here is an example of removing entire groups where the mean is <= 50. The filter method is 100x slower.

import pandas as pd
import numpy as np

N = 10000
df = pd.DataFrame({'grp': np.arange(0,N,1)//10,
                   'value': np.arange(0,N,1)%100})

# With Filter
%timeit df.groupby('grp').filter(lambda x: x['value'].mean() > 50)
#327 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# With Transform
%timeit df[df.groupby('grp')['value'].transform('mean') > 50]
#2.7 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Verify they are equivalent
(df.groupby('grp').filter(lambda x: x['value'].mean() > 50) 
  == df[df.groupby('grp')['value'].transform('mean') > 50]).all().all()
#True

1性能的提升来自于 transform 可能允许您使用在 cython 中实现的 GroupBy 操作,这是 的情况意思是.如果不是这种情况,filter 可能同样具有性能,如果不是稍微好一点.

1The gain in performance comes form the fact that transform may allow you to use a GroupBy operation which is implemented in cython, which is the case for mean. If this is not the case filter may be just as performant, if not slightly better.

最后,因为 DataFrameGroupBy.transform 向整个组广播结果,所以当需要根据整体组特征排除组内的特定行时,它是正确的工具.

Finally, because DataFrameGroupBy.transform broadcasts a result to the entire group, it is the correct tool to use when needing to exclude specific rows within a group based on an overall group characteristic.

在上面的例子中,如果你想在一个组中保留高于组的行意味着它是

In the above example, if you want to keep rows within a group that are above the group mean it is

df[df['value'] > df.groupby('grp')['value'].transform('mean')]
   # Compare          to the mean of the group the row 
   # each row                   belongs to 

这篇关于Pandas:使用 DataFrameGroupBy.filter() 方法选择值大于相应组均值的 DataFrame 行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆