pandas 数据框python中的部分相关系数 [英] partial correlation coefficient in pandas dataframe python

查看:107
本文介绍了 pandas 数据框python中的部分相关系数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在pandas数据框中有一个数据,例如:

df = 

    X1  X2  X3  Y
0   1   2   10  5.077
1   2   2   9   32.330
2   3   3   5   65.140
3   4   4   4   47.270
4   5   2   9   80.570

,我想进行多元回归分析.在这里,Y是因变量,x1,x2和x3是自变量. 每个自变量与因变量之间的相关性为:

df.corr():

      X1          X2            X3         Y
X1  1.000000    0.353553    -0.409644   0.896626
X2  0.353553    1.000000    -0.951747   0.204882
X3  -0.409644   -0.951747   1.000000    -0.389641
Y   0.896626    0.204882    -0.389641   1.000000

我们在这里可以看到y与x1的相关性最高,因此我选择了x1作为第一个自变量.然后按照这个过程,我试图选择与y的部分相关性最高的第二个自变量.所以我的问题是在这种情况下如何找到偏相关?

我们将非常感谢您的帮助.

解决方案

成对排列在Y(最后一个col)和其他之间

如果您只想查找Y与其他人之间的相关性排名,只需-

corrs = df.corr().values
ranks = (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()

样品运行-

In [145]: df
Out[145]: 
         X1        X2        X3         Y
0  0.576562  0.481220  0.148405  0.929005
1  0.732278  0.934351  0.115578  0.379051
2  0.078430  0.575374  0.945908  0.999495
3  0.391323  0.429919  0.265165  0.837510
4  0.525265  0.331486  0.951865  0.998278

In [146]: df.corr()
Out[146]: 
          X1        X2        X3         Y
X1  1.000000  0.354387 -0.642953 -0.646551
X2  0.354387  1.000000 -0.461510 -0.885174
X3 -0.642953 -0.461510  1.000000  0.649758
Y  -0.646551 -0.885174  0.649758  1.000000

In [147]: corrs = df.corr().values

In [148]: (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()
Out[148]: ['X3', 'X1', 'X2']


所有列之间的对等排名

如果您要查找彼此之间所有列之间的排名,我们将有一种类似的方法-

def pairwise_corr_rank(df):
    corrs = df.corr().values
    cols = df.columns
    n = corrs.shape[0]
    r,c = np.triu_indices(n,1)
    idx = corrs[r,c].argsort()
    out = np.c_[cols[r[idx]], cols[c[idx]], corrs[r,c][idx]][::-1]
    return pd.DataFrame(out, columns=[['P1','P2','Value']])

样品运行-

In [109]: df
Out[109]: 
   X1  X2  X3       Y
0   1   2  10   5.077
1   2   2   9  32.330
2   3   3   5  65.140
3   4   4   4  47.270
4   5   2   9  80.570

In [110]: df.corr()
Out[110]: 
          X1        X2        X3         Y
X1  1.000000  0.353553 -0.409644  0.896626
X2  0.353553  1.000000 -0.951747  0.204882
X3 -0.409644 -0.951747  1.000000 -0.389641
Y   0.896626  0.204882 -0.389641  1.000000

In [114]: pairwise_corr_rank(df)
Out[114]: 
   P1  P2     Value
0  X1   Y  0.896626
1  X1  X2  0.353553
2  X2   Y  0.204882
3  X3   Y -0.389641
4  X1  X3 -0.409644
5  X2  X3 -0.951747

I have a data in pandas dataframe like:

df = 

    X1  X2  X3  Y
0   1   2   10  5.077
1   2   2   9   32.330
2   3   3   5   65.140
3   4   4   4   47.270
4   5   2   9   80.570

and I want to do multiple regression analysis. Here Y is dependent variables and x1, x2 and x3 are independent variables. correlation between each independent variables with dependent variable is:

df.corr():

      X1          X2            X3         Y
X1  1.000000    0.353553    -0.409644   0.896626
X2  0.353553    1.000000    -0.951747   0.204882
X3  -0.409644   -0.951747   1.000000    -0.389641
Y   0.896626    0.204882    -0.389641   1.000000

​As we can see here y has highest correlation with x1 so i have selected x1 as first independent variable. And following the process I am trying to select second independent variable with highest partial correlation with y. So my question is how to find partial correlation in such case?

Your help will be highly appreciated.

解决方案

Pairwise ranks between Y (last col) and others

If you are only trying to find the correlation rank between Y and others, simply do -

corrs = df.corr().values
ranks = (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()

Sample run -

In [145]: df
Out[145]: 
         X1        X2        X3         Y
0  0.576562  0.481220  0.148405  0.929005
1  0.732278  0.934351  0.115578  0.379051
2  0.078430  0.575374  0.945908  0.999495
3  0.391323  0.429919  0.265165  0.837510
4  0.525265  0.331486  0.951865  0.998278

In [146]: df.corr()
Out[146]: 
          X1        X2        X3         Y
X1  1.000000  0.354387 -0.642953 -0.646551
X2  0.354387  1.000000 -0.461510 -0.885174
X3 -0.642953 -0.461510  1.000000  0.649758
Y  -0.646551 -0.885174  0.649758  1.000000

In [147]: corrs = df.corr().values

In [148]: (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()
Out[148]: ['X3', 'X1', 'X2']


Pairwise ranks between all columns

If you are trying to find the rank between all columns between each other, we would have one approach like so -

def pairwise_corr_rank(df):
    corrs = df.corr().values
    cols = df.columns
    n = corrs.shape[0]
    r,c = np.triu_indices(n,1)
    idx = corrs[r,c].argsort()
    out = np.c_[cols[r[idx]], cols[c[idx]], corrs[r,c][idx]][::-1]
    return pd.DataFrame(out, columns=[['P1','P2','Value']])

Sample run -

In [109]: df
Out[109]: 
   X1  X2  X3       Y
0   1   2  10   5.077
1   2   2   9  32.330
2   3   3   5  65.140
3   4   4   4  47.270
4   5   2   9  80.570

In [110]: df.corr()
Out[110]: 
          X1        X2        X3         Y
X1  1.000000  0.353553 -0.409644  0.896626
X2  0.353553  1.000000 -0.951747  0.204882
X3 -0.409644 -0.951747  1.000000 -0.389641
Y   0.896626  0.204882 -0.389641  1.000000

In [114]: pairwise_corr_rank(df)
Out[114]: 
   P1  P2     Value
0  X1   Y  0.896626
1  X1  X2  0.353553
2  X2   Y  0.204882
3  X3   Y -0.389641
4  X1  X3 -0.409644
5  X2  X3 -0.951747

这篇关于 pandas 数据框python中的部分相关系数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆