python-如何在数据矩阵中使用nans计算相关矩阵 [英] python - how to compute correlation-matrix with nans in data-matrix

查看:248
本文介绍了python-如何在数据矩阵中使用nans计算相关矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找不到能为数据中存在NaN的数组计算相关系数矩阵的函数,该数组包含两个以上变量的观测值.有一些函数可以对成对的变量执行此操作(或仅使用〜is.nan()屏蔽数组).但是,通过遍历大量变量来使用这些功能,计算每对变量的相关性会非常耗时.

I coundn't find a function that computes a matrix of correlation coefficients for arrays containing observations for more than two variables when there are NaNs in the data. There are functions doing this for pairs of variables (or just masking the arrays using ~is.nan()). But using these functions by looping over a large number of variables, computing the correlation for each pair can be very time consuming.

所以我独自尝试,很快意识到这样做的复杂性是对协方差的正确归一化的问题.我对您的意见非常感兴趣.

So I tried on my own and soon realized that the complexity of doing this is a question of the proper normalization of the Covariance. I would be very interest in your opinions on how to do it.

这是代码:

def nancorr(X,nanfact=False):
    X = X - np.nanmean(X,axis=1,keepdims = True)*np.ones((1,X.shape[1]))

    if nanfact:
        mask = np.isnan(X).astype(int)
        fact = X.shape[1] - np.dot(mask,mask.T) - 1    

    X[np.isnan(X)] = 0
    if nanfact:
        cov = np.dot(X,X.T)/fact
    else:
        cov = np.dot(X,X.T)

    d = np.diag(cov)
    return cov/np.sqrt(np.multiply.outer(d,d))

该函数假定每一行都是一个变量.基本上,它是numpy的corrcoeff()的经过调整的代码. 我相信可以通过以下三种方式进行此操作:

The function assumes that each row is a variable. It is basically an adjusted code from numpy's corrcoeff(). I believe there are three ways of doing this:

(1)对于每对变量,仅采用那些变量都不是NaN的那些观测值.如果您想同时进行多对计算并且上面的代码中没有涉及,这无疑是最准确的编程,也是最困难的编程.但是,为什么仅仅因为另一个变量的对应项是NaN而丢弃每个变量的均值和方差的信息呢?因此,还有两个选择.

(1) For each pair of variables, you take only those observations for which neither one nor the other variable is NaN. This is arguably the most accurate, but also most difficult one to program if you want to do the computation for more than one pair simultaneously and not covered in the above code. Why, however, throw away information on the mean and variance of each variable, just because the corresponding entry of another variable is NaN? Hence, two other options.

(2)我们用变量nanmean淡化每个变量,每个变量的方差就是它的变量.对于协方差,其中一个变量或另一个变量为NaN而不是两者均为NaN的每个观察值都是无协变量的观察值,因此将其设置为零.则协方差的因子为1/(观察数,其中两个变量都不都是NaN-1),用n表示.相关系数的分母中的两个方差均由其对应的非NaN观测值减去1分别表示,分别由n1和n2表示.这是通过在上面的函数中设置nanfact = True来实现的.

(2) We demean each variable by it nanmean and the variance of each variable is its nanvariance. For the covariance, each observation where one or the other variable is NaN, but not both, is an observation of no-covariation and, therefore, set to zero. The factor of the covariance is then 1/(# of observation where not both variables are NaN - 1), denoted by n. Both variances in the denominator of the correlation coefficient are factored by their corresponding number of non-NaN observations minus 1, denoted by n1 and n2 respectively. This is achived by setting nanfact=True in the function above.

(3)可能希望协方差和方差具有与没有NaNs的相关系数相同的因子.在这里执行此操作的唯一有意义的方法(如果选项(1)不可行)是简单地忽略(1/n)/sqrt(1/n1 * n2).由于此数字小于1,因此估计的相关系数(在绝对值上)将比(2)中的大,但将保持在-1,1之间.这可以通过设置nanfact = False来实现.

(3) One may wish that the covariance and the variances have the same factor as it is the case for correlation coefficient without NaNs. The only meaningful way to do this here (if option (1) is not feasable), is to simply ignore (1/n)/sqrt(1/n1*n2). Since this number is smaller than one, the estimated correlation coefficients will be larger (in absolute value) than in (2), but will remain between -1,1. This is achieved by setting nanfact=False.

我会对您对方法(2)和(3)的看法非常感兴趣,尤其是,我非常希望看到不使用循环的(1)解决方案.

I'd be very interested in your opinions on approaches (2) and (3) and especially, I would very much like to see a solution to (1) without the use of loops.

推荐答案

我认为您正在寻找的方法是熊猫提供的corr().例如,如下所示的数据框.您也可以参考此问题. 如何有效地获取具有NaN值的数据帧的相关矩阵(具有p值)?

I think the method you are looking for is corr() from pandas. For example, a dataframe as following. You can also refer to this question. How to efficiently get the correlation matrix (with p-values) of a data frame with NaN values?

import pandas as pd
df = pd.DataFrame({'A': [2, None, 1, -4, None, None, 3],
                   'B': [None, 1, None, None, 1, 3, None],
                   'C': [2, 1, None, 2, 2.1, 1, 0],
                   'D': [-2, 1.1, 3.2, 2, None, 1, None]})

df

    A       B       C       D
0   2       NaN     2       -2
1   NaN     1       1       1.1
2   1       NaN     NaN     3.2
3   -4      NaN     2       2
4   NaN     1       2.1     NaN
5   NaN     3       1       1
6   3       NaN     0       NaN

rho = df.corr()
rho

       A          B            C           D
A   1.000000     NaN       -0.609994    -0.441784
B   NaN          1.0       -0.500000    -1.000000
C   -0.609994    -0.5       1.000000    -0.347928
D   0.041204     -1.0       -0.347928    1.000000

这篇关于python-如何在数据矩阵中使用nans计算相关矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆