pandas 数据框python中的部分相关系数 [英] partial correlation coefficient in pandas dataframe python
问题描述
我在pandas数据框中有一个数据,例如:
df =
X1 X2 X3 Y
0 1 2 10 5.077
1 2 2 9 32.330
2 3 3 5 65.140
3 4 4 4 47.270
4 5 2 9 80.570
,我想进行多元回归分析.在这里,Y是因变量,x1,x2和x3是自变量. 每个自变量与因变量之间的相关性为:
df.corr():
X1 X2 X3 Y
X1 1.000000 0.353553 -0.409644 0.896626
X2 0.353553 1.000000 -0.951747 0.204882
X3 -0.409644 -0.951747 1.000000 -0.389641
Y 0.896626 0.204882 -0.389641 1.000000
我们在这里可以看到y与x1的相关性最高,因此我选择了x1作为第一个自变量.然后按照这个过程,我试图选择与y的部分相关性最高的第二个自变量.所以我的问题是在这种情况下如何找到偏相关?
我们将非常感谢您的帮助.
成对排列在Y
(最后一个col)和其他之间
如果您只想查找Y
与其他人之间的相关性排名,只需-
corrs = df.corr().values
ranks = (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()
样品运行-
In [145]: df
Out[145]:
X1 X2 X3 Y
0 0.576562 0.481220 0.148405 0.929005
1 0.732278 0.934351 0.115578 0.379051
2 0.078430 0.575374 0.945908 0.999495
3 0.391323 0.429919 0.265165 0.837510
4 0.525265 0.331486 0.951865 0.998278
In [146]: df.corr()
Out[146]:
X1 X2 X3 Y
X1 1.000000 0.354387 -0.642953 -0.646551
X2 0.354387 1.000000 -0.461510 -0.885174
X3 -0.642953 -0.461510 1.000000 0.649758
Y -0.646551 -0.885174 0.649758 1.000000
In [147]: corrs = df.corr().values
In [148]: (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()
Out[148]: ['X3', 'X1', 'X2']
所有列之间的对等排名
如果您要查找彼此之间所有列之间的排名,我们将有一种类似的方法-
def pairwise_corr_rank(df):
corrs = df.corr().values
cols = df.columns
n = corrs.shape[0]
r,c = np.triu_indices(n,1)
idx = corrs[r,c].argsort()
out = np.c_[cols[r[idx]], cols[c[idx]], corrs[r,c][idx]][::-1]
return pd.DataFrame(out, columns=[['P1','P2','Value']])
样品运行-
In [109]: df
Out[109]:
X1 X2 X3 Y
0 1 2 10 5.077
1 2 2 9 32.330
2 3 3 5 65.140
3 4 4 4 47.270
4 5 2 9 80.570
In [110]: df.corr()
Out[110]:
X1 X2 X3 Y
X1 1.000000 0.353553 -0.409644 0.896626
X2 0.353553 1.000000 -0.951747 0.204882
X3 -0.409644 -0.951747 1.000000 -0.389641
Y 0.896626 0.204882 -0.389641 1.000000
In [114]: pairwise_corr_rank(df)
Out[114]:
P1 P2 Value
0 X1 Y 0.896626
1 X1 X2 0.353553
2 X2 Y 0.204882
3 X3 Y -0.389641
4 X1 X3 -0.409644
5 X2 X3 -0.951747
I have a data in pandas dataframe like:
df =
X1 X2 X3 Y
0 1 2 10 5.077
1 2 2 9 32.330
2 3 3 5 65.140
3 4 4 4 47.270
4 5 2 9 80.570
and I want to do multiple regression analysis. Here Y is dependent variables and x1, x2 and x3 are independent variables. correlation between each independent variables with dependent variable is:
df.corr():
X1 X2 X3 Y
X1 1.000000 0.353553 -0.409644 0.896626
X2 0.353553 1.000000 -0.951747 0.204882
X3 -0.409644 -0.951747 1.000000 -0.389641
Y 0.896626 0.204882 -0.389641 1.000000
As we can see here y has highest correlation with x1 so i have selected x1 as first independent variable. And following the process I am trying to select second independent variable with highest partial correlation with y. So my question is how to find partial correlation in such case?
Your help will be highly appreciated.
Pairwise ranks between Y
(last col) and others
If you are only trying to find the correlation rank between Y
and others, simply do -
corrs = df.corr().values
ranks = (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()
Sample run -
In [145]: df
Out[145]:
X1 X2 X3 Y
0 0.576562 0.481220 0.148405 0.929005
1 0.732278 0.934351 0.115578 0.379051
2 0.078430 0.575374 0.945908 0.999495
3 0.391323 0.429919 0.265165 0.837510
4 0.525265 0.331486 0.951865 0.998278
In [146]: df.corr()
Out[146]:
X1 X2 X3 Y
X1 1.000000 0.354387 -0.642953 -0.646551
X2 0.354387 1.000000 -0.461510 -0.885174
X3 -0.642953 -0.461510 1.000000 0.649758
Y -0.646551 -0.885174 0.649758 1.000000
In [147]: corrs = df.corr().values
In [148]: (df.columns[:-1][-corrs[:-1,-1].argsort()]).tolist()
Out[148]: ['X3', 'X1', 'X2']
Pairwise ranks between all columns
If you are trying to find the rank between all columns between each other, we would have one approach like so -
def pairwise_corr_rank(df):
corrs = df.corr().values
cols = df.columns
n = corrs.shape[0]
r,c = np.triu_indices(n,1)
idx = corrs[r,c].argsort()
out = np.c_[cols[r[idx]], cols[c[idx]], corrs[r,c][idx]][::-1]
return pd.DataFrame(out, columns=[['P1','P2','Value']])
Sample run -
In [109]: df
Out[109]:
X1 X2 X3 Y
0 1 2 10 5.077
1 2 2 9 32.330
2 3 3 5 65.140
3 4 4 4 47.270
4 5 2 9 80.570
In [110]: df.corr()
Out[110]:
X1 X2 X3 Y
X1 1.000000 0.353553 -0.409644 0.896626
X2 0.353553 1.000000 -0.951747 0.204882
X3 -0.409644 -0.951747 1.000000 -0.389641
Y 0.896626 0.204882 -0.389641 1.000000
In [114]: pairwise_corr_rank(df)
Out[114]:
P1 P2 Value
0 X1 Y 0.896626
1 X1 X2 0.353553
2 X2 Y 0.204882
3 X3 Y -0.389641
4 X1 X3 -0.409644
5 X2 X3 -0.951747
这篇关于 pandas 数据框python中的部分相关系数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!