Pandas 相关性 Groupby [英] Pandas Correlation Groupby

查看:71
本文介绍了Pandas 相关性 Groupby的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个类似于下面的数据框,我将如何获得 2 个特定列之间的相关性,然后按ID"列进行分组?我相信 Pandas 'corr' 方法可以找到所有列之间的相关性.如果可能,我还想知道如何使用 .agg 函数(即 np.correlate)找到groupby"相关性.

我有什么:

ID Val1 Val2 OtherData OtherData一个 5 4 x x一个 4 5 x x一个 6 6 x xB 4 1 x xB 8 2 x x乙 7 9 x xC 4 8 x xC 5 5 x xC 2 1 x x

我需要什么:

ID Correlation_Val1_Val20.12乙 0.220.05

谢谢!

解决方案

你几乎想通了所有的部分,只需要组合它们:

<预><代码>>>>df.groupby('ID')[['Val1','Val2']].corr()瓦尔1 瓦尔2IDA Val1 1.000000 0.500000Val2 0.500000 1.000000B 值 1 1.000000 0.385727Val2 0.385727 1.000000

在您的情况下,为每个 ID 打印 2x2 过于冗长.我没有看到打印标量相关性而不是整个矩阵的选项,但是如果您只有两个变量,您可以执行以下简单的操作:

<预><代码>>>>df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2,-1]ID一个 Val1 0.500000乙瓦尔1 0.385727

对于 3+ 个变量的更一般情况

对于 3 个或更多变量,创建简洁的输出并不简单,但您可以执行以下操作:

groups = list('Val1', 'Val2', 'Val3', 'Val4')df2 = pd.DataFrame()对于范围内的 i(len(groups)-1):df2 = df2.append(df.groupby('ID')[groups].corr().stack().loc[:,groups[i],groups[i+1]:].reset_index() )df2.columns = ['ID', 'v1', 'v2', 'corr']df2.set_index(['ID','v1','v2']).sort_index()

请注意,如果我们没有 groupby 元素,则可以直接使用 numpy 中的上三角或下三角函数.但由于该元素存在,据我所知,以更优雅的方式生成简洁的输出并不容易.

Assuming I have a dataframe similar to the below, how would I get the correlation between 2 specific columns and then group by the 'ID' column? I believe the Pandas 'corr' method finds the correlation between all columns. If possible I would also like to know how I could find the 'groupby' correlation using the .agg function (i.e. np.correlate).

What I have:

ID  Val1    Val2    OtherData   OtherData
A   5       4       x           x
A   4       5       x           x
A   6       6       x           x
B   4       1       x           x
B   8       2       x           x
B   7       9       x           x
C   4       8       x           x
C   5       5       x           x
C   2       1       x           x

What I need:

ID  Correlation_Val1_Val2
A   0.12
B   0.22
C   0.05

Thanks!

解决方案

You pretty much figured out all the pieces, just need to combine them:

>>> df.groupby('ID')[['Val1','Val2']].corr()

             Val1      Val2
ID                         
A  Val1  1.000000  0.500000
   Val2  0.500000  1.000000
B  Val1  1.000000  0.385727
   Val2  0.385727  1.000000

In your case, printing out a 2x2 for each ID is excessively verbose. I don't see an option to print a scalar correlation instead of the whole matrix, but you can do something simple like this if you only have two variables:

>>> df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2,-1]

ID       
A   Val1    0.500000
B   Val1    0.385727

For the more general case of 3+ variables

For 3 or more variables, it is not straightforward to create concise output but you could do something like this:

groups = list('Val1', 'Val2', 'Val3', 'Val4')
df2 = pd.DataFrame()
for i in range( len(groups)-1): 
    df2 = df2.append( df.groupby('ID')[groups].corr().stack()
                        .loc[:,groups[i],groups[i+1]:].reset_index() )

df2.columns = ['ID', 'v1', 'v2', 'corr']
df2.set_index(['ID','v1','v2']).sort_index()

Note that if we didn't have the groupby element, it would be straightforward to use an upper or lower triangle function from numpy. But since that element is present, it is not so easy to produce concise output in a more elegant manner as far as I can tell.

这篇关于Pandas 相关性 Groupby的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆