皮尔逊相关性和nan值 [英] Pearson correlation and nan values

查看:516
本文介绍了皮尔逊相关性和nan值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个具有数百个列的CSV_file,我想为两个CSV_files的每个相同列计算Pearson相关系数和p值.问题是,当在一列中缺少数据"NaN"时,它将给我一个错误.当".dropna"从列中删除nan值时,有时X和Y的形状不相等(基于删除的nan值),我收到此错误:

I have two CSV_files with hundreds of columns and I want to calculate Pearson correlation coefficient and p value for every same columns of two CSV_files. The problem is that when there is a missing data "NaN" in one column, it gives me an error. When ".dropna" removes nan value from columns, sometimes the shapes of X and Y are not equal (based on removed nan values) and I receive this error:

"ValueError:操作数不能与形状(1020,)(1016,)一起广播"

"ValueError: operands could not be broadcast together with shapes (1020,) (1016,)"

问题:如果"nan"中一个csv中的第8行,是否也有办法从另一个csv中删除同一行,并基于具有两个csv文件中的值的行对每一列进行分析? /p>

Question: If row #8 in one csv in "nan", is there any way to remove the same row from the other csv too and do the analysis for every column based on rows that have values from both csv files?

import pandas as pd
import scipy
import csv
import numpy as np
from scipy import stats


df = pd.read_csv ("D:/Insitu-Daily.csv",header = None)
dg = pd.read_csv ("D:/Model-Daily.csv",header = None)

pearson_corr_set = []
pearson_p_set = []


for i in range(1,df.shape[1]):
    X= df[i].dropna(axis=0, how='any')
    Y= dg[i].dropna(axis=0, how='any')

    [pearson_corr, pearson_p] = scipy.stats.stats.pearsonr(X, Y)
    pearson_corr_set = np.append(pearson_corr_set,pearson_corr)
    pearson_p_set = np.append(pearson_p_set,pearson_p)

with open('D:/Results.csv','wb') as file:
    str1 = ",".join(str(i) for i in np.asarray(pearson_corr_set))
    file.write(str1)
    file.write('\n')    
    str1 = ",".join(str(i) for i in np.asarray(pearson_p_set))
    file.write(str1)
    file.write('\n') 

推荐答案

这里是一种解决方案.首先为您的2个numpy数组计算坏"索引.然后屏蔽以忽略那些不良索引.

Here is one solution. First calculate the "bad" indices for your 2 numpy arrays. Then mask to ignore those bad indices.

x = np.array([5, 1, 6, 9, 10, np.nan, 1, 1, np.nan])
y = np.array([4, 4, 5, np.nan, 6, 2, 1, 8, 1])

bad = ~np.logical_or(np.isnan(x), np.isnan(y))

np.compress(bad, x)  # array([  5.,   1.,   6.,  10.,   1.,   1.])
np.compress(bad, y)  # array([ 4.,  4.,  5.,  6.,  1.,  8.])

这篇关于皮尔逊相关性和nan值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆