如何计算Pandas中列的成对相关的p值? [英] How to calculate p-values for pairwise correlation of columns in Pandas?
问题描述
Pandas具有非常方便的功能,可以使用 pd.corr()。
这意味着可以比较任何长度的列之间的相关性。例如:
Pandas has the very handy function to do pairwise correlation of columns using pd.corr(). That means it is possible to compare correlations between columns of any length. For instance:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10)))
0 1 2 3 4 5 6 7 8 9
0 9 17 55 32 7 97 61 47 48 46
1 8 83 87 56 17 96 81 8 87 0
2 60 29 8 68 56 63 81 5 24 52
3 42 76 6 75 7 59 19 17 3 63
...
现在可以使用 df.corr(method ='pearson')
来测试所有10列之间的相关性:
Now it is possible to test correlation between all 10 columns with df.corr(method='pearson')
:
0 1 2 3 4 5 6 7 8 9
0 1.000000 0.082789 -0.094096 -0.086091 0.163091 0.013210 0.167204 -0.002514 0.097481 0.091020
1 0.082789 1.000000 0.027158 -0.080073 0.056364 -0.050978 -0.018428 -0.014099 -0.135125 -0.043797
2 -0.094096 0.027158 1.000000 -0.102975 0.101597 -0.036270 0.202929 0.085181 0.093723 -0.055824
3 -0.086091 -0.080073 -0.102975 1.000000 -0.149465 0.033130 -0.020929 0.183301 -0.003853 -0.062889
4 0.163091 0.056364 0.101597 -0.149465 1.000000 -0.007567 -0.017212 -0.086300 0.177247 -0.008612
5 0.013210 -0.050978 -0.036270 0.033130 -0.007567 1.000000 -0.080148 -0.080915 -0.004612 0.243713
6 0.167204 -0.018428 0.202929 -0.020929 -0.017212 -0.080148 1.000000 0.135348 0.070330 0.008170
7 -0.002514 -0.014099 0.085181 0.183301 -0.086300 -0.080915 0.135348 1.000000 -0.114413 -0.111642
8 0.097481 -0.135125 0.093723 -0.003853 0.177247 -0.004612 0.070330 -0.114413 1.000000 -0.153564
9 0.091020 -0.043797 -0.055824 -0.062889 -0.008612 0.243713 0.008170 -0.111642 -0.153564 1.000000
是否有一种简单的方法也可以获取相应的p值(理想情况下是熊猫),因为它返回了例如由scipy的 kendalltau()?
Is there a simple way to also get the corresponding p-values (ideally in pandas), as it is returned e.g. by scipy's kendalltau()?
推荐答案
可能只是循环。基本上,熊猫在源代码中所做的就是生成相关矩阵:
Probably just loop. It's basically what pandas does in the source code to generate the correlation matrix anyway:
import pandas as pd
import numpy as np
from scipy import stats
df_corr = pd.DataFrame() # Correlation matrix
df_p = pd.DataFrame() # Matrix of p-values
for x in df.columns:
for y in df.columns:
corr = stats.pearsonr(df[x], df[y])
df_corr.loc[x,y] = corr[0]
df_p.loc[x,y] = corr[1]
如果您想利用对称的事实,那么只需要对其中一半进行计算,就可以做到:
If you want to leverage the fact that this is symmetric, so you only need to calculate this for roughly half of them, then do:
mat = df.values.T
K = len(df.columns)
correl = np.empty((K,K), dtype=float)
p_vals = np.empty((K,K), dtype=float)
for i, ac in enumerate(mat):
for j, bc in enumerate(mat):
if i > j:
continue
else:
corr = stats.pearsonr(ac, bc)
#corr = stats.kendalltau(ac, bc)
correl[i,j] = corr[0]
correl[j,i] = corr[0]
p_vals[i,j] = corr[1]
p_vals[j,i] = corr[1]
df_p = pd.DataFrame(p_vals)
df_corr = pd.DataFrame(correl)
#pd.concat([df_corr, df_p], keys=['corr', 'p_val'])
这篇关于如何计算Pandas中列的成对相关的p值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!