如何有效地获取具有NaN值的数据帧的相关矩阵(具有p值)? [英] How to efficiently get the correlation matrix (with p-values) of a data frame with NaN values?

查看：110 发布时间：2020/5/24 2:00:53 python pandas correlation p-value

本文介绍了如何有效地获取具有NaN值的数据帧的相关矩阵(具有p值)?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图计算一个相关矩阵，并根据p值过滤相关性，以找出高度相关的对.

I am trying to compute a matrix of correlation, and filter the correlations based on the p-values to find out the highly correlated pairs.

要解释我的意思，请说我有一个这样的数据框.

To explain what I mean, say I have a data frame like this.

df

    A       B       C       D
0   2       NaN     2       -2
1   NaN     1       1       1.1
2   1       NaN     NaN     3.2
3   -4      NaN     2       2
4   NaN     1       2.1     NaN
5   NaN     3       1       1
6   3       NaN     0       NaN

相关系数.我使用了pd.corr().此方法可以处理具有NaN值的数据帧，更重要的是，它可以容忍具有0个重叠的列对(列A和列B):

For the correlation coefficient. I used pd.corr(). This method can process data frame with NaN values, and more importantly, it tolerates pair of columns having 0 overlap (col A and col B):

rho = df.corr()

       A          B            C           D
A   1.000000     NaN       -0.609994    0.041204
B   NaN          1.0       -0.500000    -1.000000
C   -0.609994    -0.5       1.000000    0.988871
D   0.041204     -1.0       0.988871    1.000000

面临的挑战是计算p值.我没有找到执行此操作的内置方法.但是，在具有统计意义的熊猫列关联中，@ BKay提供了一种循环方法计算p值.如果重叠数少于3，则此方法会报告错误.因此，我通过添加错误例外进行了一些修改.

The challenge is to compute p-value. I didn't find a built-in method to do this. However from pandas columns correlation with statistical significance, @BKay provided a loop way to compute the p-value. This method will complain error if there are fewer than 3 overlaps.So I did some modification by adding error exception.

ValueError:大小为零的数组，直到没有身份的缩小操作最大值

ValueError: zero-size array to reduction operation maximum which has no identity

pval = rho.copy()
for i in range(df.shape[1]): # rows are the number of rows in the matrix.
    for j in range(df.shape[1]):
        try:
            df_ols = pd.ols(y=df.iloc[:,i], x=df.iloc[:,j], intercept=True)
            pval.iloc[i,j]  = df_ols.f_stat['p-value']
        except ValueError:
            pval.iloc[i,j]  = None

pval
        A        B            C           D
A   0.000000    NaN         0.582343    0.973761
B   NaN         0.000000    0.666667    NaN
C   0.582343    0.666667    0.000000    0.011129
D   0.973761    NaN         0.011129    0.000000

此方法输出一个p值矩阵，但是当原始数据帧的大小增加时(我的实际数据帧为〜5000行x 500列)，它会变得非常慢.您将建议如何针对大型数据帧有效地获取此p值矩阵.

This method outputs a p-value matrix, but it gets extremely slow when the size of the original data frame increase (my real data frame is ~ 5000 rows x 500 columns). What would you suggest to do to get this p-value matrix efficiently for a large size data frame.

如何有效地获取具有NaN值的数据帧的相关矩阵(具有p值)? [英] How to efficiently get the correlation matrix (with p-values) of a data frame with NaN values?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何有效地获取具有NaN值的数据帧的相关矩阵(具有p值)? [英] How to efficiently get the correlation matrix (with p-values) of a data frame with NaN values?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭