两个分组的DataFrame列的Pandas简单关联 [英] Pandas simple correlation of two grouped DataFrame columns

查看:124
本文介绍了两个分组的DataFrame列的Pandas简单关联的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一种很好的方法来获取两个分组的DataFrame列的简单关联?

Is there a good way to get the simple correlation of two grouped DataFrame columns?

似乎无论熊猫.corr()函数想要返回什么相关矩阵.例如,

It seems like no matter what the pandas .corr() functions want to return a correlation matrix. E.g.,

i = pd.MultiIndex.from_product([['A','B','C'], np.arange(1, 11, 1)], names=['Name','Num'])
test = pd.DataFrame(np.random.randn(30, 2), i, columns=['X', 'Y'])
test.groupby(['Name'])['X','Y'].corr()

返回

               X         Y
Name                      
A    X  1.000000  0.152663
     Y  0.152663  1.000000
B    X  1.000000 -0.155113
     Y -0.155113  1.000000
C    X  1.000000  0.214197
     Y  0.214197  1.000000

但是很明显,我只对非对角线术语感兴趣.计算这四个值然后尝试选择我想要的值似乎很麻烦,就像

But clearly I am only interested in the off-diagonal term. And it seems kludgy to calculate the four values and then try to select the one I want, as in

test.groupby(['Name'])['X','Y'].corr().ix[0::2,'Y']

获得

A     X    0.152663
B     X   -0.155113
C     X    0.214197

推荐答案

我希望像test.groupby('Name')['X'].corr('Y')这样的东西可以工作,但是它不起作用,并且当您通过Series本身(test['Y'])时,它会变慢.在这一点上,应用似乎是最好的选择:

I would expect something like test.groupby('Name')['X'].corr('Y') to work but it doesn't and when you pass the Series itself (test['Y']) it becomes slower. At this point it seems apply is the best option:

test.groupby('Name').apply(lambda df: df['X'].corr(df['Y']))
Out: 
Name
A   -0.484955
B    0.520701
C    0.120879
dtype: float64

这会遍历每个组,并在每个分组的DataFrame中应用Series.corr.区别在于未设置随机种子.

This iterates over each group and applies Series.corr in each grouped DataFrame. The differences arise from not setting a random seed.

这篇关于两个分组的DataFrame列的Pandas简单关联的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆