矩阵的Python Scipy Spearman相关性与两数组相关性也不匹配pandas.Data.Frame.corr() [英] Python Scipy spearman correlation for matrix does not match two-array correlation nor pandas.Data.Frame.corr()

查看:308
本文介绍了矩阵的Python Scipy Spearman相关性与两数组相关性也不匹配pandas.Data.Frame.corr()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在计算矩阵的Spearman相关性。当使用 scipy.stats.spearmanr 时,我发现矩阵输入和两数组输入给出了不同的结果。结果也与 pandas.Data.Frame.corr 不同。

I was computing spearman correlations for matrix. I found the matrix input and two-array input gave different results when using scipy.stats.spearmanr. The results are also different from pandas.Data.Frame.corr.

from scipy.stats import spearmanr # scipy 1.0.1
import pandas as pd # 0.22.0
import numpy as np
#Data 
X = pd.DataFrame({"A":[-0.4,1,12,78,84,26,0,0], "B":[-0.4,3.3,54,87,25,np.nan,0,1.2], "C":[np.nan,56,78,0,np.nan,143,11,np.nan], "D":[0,-9.3,23,72,np.nan,-2,-0.3,-0.4], "E":[78,np.nan,np.nan,0,-1,-11,1,323]})
matrix_rho_scipy = spearmanr(X,nan_policy='omit',axis=0)[0]
matrix_rho_pandas = X.corr('spearman')
print(matrix_rho_scipy == matrix_rho_pandas.values) # All False except diagonal
print(spearmanr(X['A'],X['B'],nan_policy='omit',axis=0)[0]) # 0.8839285714285714 from scipy 1.0.1
print(spearmanr(X['A'],X['B'],nan_policy='omit',axis=0)[0]) # 0.8829187134416477 from scipy 1.1.0
print(matrix_rho_scipy[0,1]) # 0.8263621207201486
print(matrix_rho_pandas.values[0,1]) # 0.8829187134416477

后来我发现Pandas的rho与R的rho相同。

Later I found Pandas's rho is the same as R's rho.

X = data.frame(A=c(-0.4,1,12,78,84,26,0,0), 
  B=c(-0.4,3.3,54,87,25,NaN,0,1.2), C=c(NaN,56,78,0,NaN, 143,11,NaN), 
  D=c(0,-9.3,23,72,NaN,-2,-0.3,-0.4), E=c(78,NaN,NaN,0,-1,-11,1,323)) 
cor.test(X$A,X$B,method='spearman', exact = FALSE, na.action="na.omit") # 0.8829187 

但是,Pandas的corr不适用于大型桌子(例如,此处,我的案例是16,000。

However, Pandas's corr doesn't work with large tables (e.g., here and my case is 16,000).

感谢 Warren Weckesser 的测试,我发现Scipy 1.1.0(但不是1.0.1)的两个数组结果与Pandas和R相同。

Thanks to Warren Weckesser's testing, I found the two-array results from Scipy 1.1.0 (but not 1.0.1) are the same results as Pandas and R.

请告诉我是否有建议或意见。谢谢。

Please let me know if you have any suggestions or comments. Thank you.

我使用Python:3.6.2(Anaconda); Mac OS:10.10.5。

I use Python: 3.6.2 (Anaconda); Mac OS: 10.10.5.

推荐答案

看来, scipy.stats.spearmanr 时,$ c>不能按预期处理 nan 值。以下是一个脚本,该脚本比较了几种计算成对Spearman排名相关性的方法:

It appears that scipy.stats.spearmanr doesn't handle nan values as expected when the input is an array and an axis is given. Here's a script that compares a few methods of computing pairwise Spearman rank-order correlations:

import numpy as np
import pandas as pd
from scipy.stats import spearmanr


x = np.array([[np.nan,    3.0, 4.0, 5.0, 5.1, 6.0, 9.2],
              [5.0,    np.nan, 4.1, 4.8, 4.9, 5.0, 4.1],
              [0.5,       4.0, 7.1, 3.8, 8.0, 5.1, 7.6]])

r = spearmanr(x, nan_policy='omit', axis=1)[0]
print("spearmanr, array:           %11.7f %11.7f %11.7f" % (r[0, 1], r[0, 2], r[1, 2]))

r01 = spearmanr(x[0], x[1], nan_policy='omit')[0]
r02 = spearmanr(x[0], x[2], nan_policy='omit')[0]
r12 = spearmanr(x[1], x[2], nan_policy='omit')[0]

print("spearmanr, individual:      %11.7f %11.7f %11.7f" % (r01, r02, r12))

df = pd.DataFrame(x.T)
c = df.corr('spearman')

print("Pandas df.corr('spearman'): %11.7f %11.7f %11.7f" % (c[0][1], c[0][2], c[1][2]))
print("R cor.test:                   0.2051957   0.4857143  -0.4707919")
print('  (method="spearman", continuity=FALSE)')

"""
# R code:
> x0 = c(NA, 3, 4, 5, 5.1, 6.0, 9.2)
> x1 = c(5.0, NA, 4.1, 4.8, 4.9, 5.0, 4.1)
> x2 = c(0.5, 4.0, 7.1, 3.8, 8.0, 5.1, 7.6)
> cor.test(x0, x1, method="spearman", continuity=FALSE)
> cor.test(x0, x2, method="spearman", continuity=FALSE)
> cor.test(x1, x2, method="spearman", continuity=FALSE)
"""

输出:

spearmanr, array:            -0.0727393  -0.0714286  -0.4728054
spearmanr, individual:        0.2051957   0.4857143  -0.4707919
Pandas df.corr('spearman'):   0.2051957   0.4857143  -0.4707919
R cor.test:                   0.2051957   0.4857143  -0.4707919
  (method="spearman", continuity=FALSE)

我的建议是不要使用 scipy.stats.spearmanr 格式为 spearmanr(x,nan_policy ='omit',axis =< whatever>)。使用 corr()方法,或使用循环使用 spearmanr(x0,x1,nan_policy ='omit')成对计算值。

My suggestion is to not use scipy.stats.spearmanr in the form spearmanr(x, nan_policy='omit', axis=<whatever>). Use the corr() method of the Pandas DataFrame, or use a loop to compute the values pairwise using spearmanr(x0, x1, nan_policy='omit').

这篇关于矩阵的Python Scipy Spearman相关性与两数组相关性也不匹配pandas.Data.Frame.corr()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆