如何有效地获取具有NaN值的数据帧的相关矩阵(具有p值)? [英] How to efficiently get the correlation matrix (with p-values) of a data frame with NaN values?
问题描述
我试图计算一个相关矩阵,并根据p值过滤相关性,以找出高度相关的对.
I am trying to compute a matrix of correlation, and filter the correlations based on the p-values to find out the highly correlated pairs.
要解释我的意思,请说我有一个这样的数据框.
To explain what I mean, say I have a data frame like this.
df
A B C D
0 2 NaN 2 -2
1 NaN 1 1 1.1
2 1 NaN NaN 3.2
3 -4 NaN 2 2
4 NaN 1 2.1 NaN
5 NaN 3 1 1
6 3 NaN 0 NaN
相关系数.我使用了pd.corr().此方法可以处理具有NaN值的数据帧,更重要的是,它可以容忍具有0个重叠的列对(列A和列B):
For the correlation coefficient. I used pd.corr(). This method can process data frame with NaN values, and more importantly, it tolerates pair of columns having 0 overlap (col A and col B):
rho = df.corr()
A B C D
A 1.000000 NaN -0.609994 0.041204
B NaN 1.0 -0.500000 -1.000000
C -0.609994 -0.5 1.000000 0.988871
D 0.041204 -1.0 0.988871 1.000000
面临的挑战是计算p值.我没有找到执行此操作的内置方法.但是,在具有统计意义的熊猫列关联中,@ BKay提供了一种循环方法计算p值.如果重叠数少于3,则此方法会报告错误.因此,我通过添加错误例外进行了一些修改.
The challenge is to compute p-value. I didn't find a built-in method to do this. However from pandas columns correlation with statistical significance, @BKay provided a loop way to compute the p-value. This method will complain error if there are fewer than 3 overlaps.So I did some modification by adding error exception.
ValueError:大小为零的数组,直到没有身份的缩小操作最大值
ValueError: zero-size array to reduction operation maximum which has no identity
pval = rho.copy()
for i in range(df.shape[1]): # rows are the number of rows in the matrix.
for j in range(df.shape[1]):
try:
df_ols = pd.ols(y=df.iloc[:,i], x=df.iloc[:,j], intercept=True)
pval.iloc[i,j] = df_ols.f_stat['p-value']
except ValueError:
pval.iloc[i,j] = None
pval
A B C D
A 0.000000 NaN 0.582343 0.973761
B NaN 0.000000 0.666667 NaN
C 0.582343 0.666667 0.000000 0.011129
D 0.973761 NaN 0.011129 0.000000
此方法输出一个p值矩阵,但是当原始数据帧的大小增加时(我的实际数据帧为〜5000行x 500列),它会变得非常慢.您将建议如何针对大型数据帧有效地获取此p值矩阵.
This method outputs a p-value matrix, but it gets extremely slow when the size of the original data frame increase (my real data frame is ~ 5000 rows x 500 columns). What would you suggest to do to get this p-value matrix efficiently for a large size data frame.
推荐答案
这个问题原来是一个很好的解决方案.
This question turned out to be a good solution.
这篇关于如何有效地获取具有NaN值的数据帧的相关矩阵(具有p值)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!