矩阵的Python Scipy Spearman相关性与两数组相关性也不匹配pandas.Data.Frame.corr() [英] Python Scipy spearman correlation for matrix does not match two-array correlation nor pandas.Data.Frame.corr()
问题描述
我正在计算矩阵的Spearman相关性。当使用 scipy.stats.spearmanr
时,我发现矩阵输入和两数组输入给出了不同的结果。结果也与 pandas.Data.Frame.corr
不同。
I was computing spearman correlations for matrix. I found the matrix input and two-array input gave different results when using scipy.stats.spearmanr
. The results are also different from pandas.Data.Frame.corr
.
from scipy.stats import spearmanr # scipy 1.0.1
import pandas as pd # 0.22.0
import numpy as np
#Data
X = pd.DataFrame({"A":[-0.4,1,12,78,84,26,0,0], "B":[-0.4,3.3,54,87,25,np.nan,0,1.2], "C":[np.nan,56,78,0,np.nan,143,11,np.nan], "D":[0,-9.3,23,72,np.nan,-2,-0.3,-0.4], "E":[78,np.nan,np.nan,0,-1,-11,1,323]})
matrix_rho_scipy = spearmanr(X,nan_policy='omit',axis=0)[0]
matrix_rho_pandas = X.corr('spearman')
print(matrix_rho_scipy == matrix_rho_pandas.values) # All False except diagonal
print(spearmanr(X['A'],X['B'],nan_policy='omit',axis=0)[0]) # 0.8839285714285714 from scipy 1.0.1
print(spearmanr(X['A'],X['B'],nan_policy='omit',axis=0)[0]) # 0.8829187134416477 from scipy 1.1.0
print(matrix_rho_scipy[0,1]) # 0.8263621207201486
print(matrix_rho_pandas.values[0,1]) # 0.8829187134416477
后来我发现Pandas的rho与R的rho相同。
Later I found Pandas's rho is the same as R's rho.
X = data.frame(A=c(-0.4,1,12,78,84,26,0,0),
B=c(-0.4,3.3,54,87,25,NaN,0,1.2), C=c(NaN,56,78,0,NaN, 143,11,NaN),
D=c(0,-9.3,23,72,NaN,-2,-0.3,-0.4), E=c(78,NaN,NaN,0,-1,-11,1,323))
cor.test(X$A,X$B,method='spearman', exact = FALSE, na.action="na.omit") # 0.8829187
但是,Pandas的corr不适用于大型桌子(例如,此处,我的案例是16,000。
However, Pandas's corr doesn't work with large tables (e.g., here and my case is 16,000).
感谢 Warren Weckesser 的测试,我发现Scipy 1.1.0(但不是1.0.1)的两个数组结果与Pandas和R相同。
Thanks to Warren Weckesser's testing, I found the two-array results from Scipy 1.1.0 (but not 1.0.1) are the same results as Pandas and R.
请告诉我是否有建议或意见。谢谢。
Please let me know if you have any suggestions or comments. Thank you.
我使用Python:3.6.2(Anaconda); Mac OS:10.10.5。
I use Python: 3.6.2 (Anaconda); Mac OS: 10.10.5.
推荐答案
看来, scipy.stats.spearmanr $ c当输入是数组并且给出了
轴
时,$ c>不能按预期处理 nan
值。以下是一个脚本,该脚本比较了几种计算成对Spearman排名相关性的方法:
It appears that scipy.stats.spearmanr
doesn't handle nan
values as expected when the input is an array and an axis
is given. Here's a script that compares a few methods of computing pairwise Spearman rank-order correlations:
import numpy as np
import pandas as pd
from scipy.stats import spearmanr
x = np.array([[np.nan, 3.0, 4.0, 5.0, 5.1, 6.0, 9.2],
[5.0, np.nan, 4.1, 4.8, 4.9, 5.0, 4.1],
[0.5, 4.0, 7.1, 3.8, 8.0, 5.1, 7.6]])
r = spearmanr(x, nan_policy='omit', axis=1)[0]
print("spearmanr, array: %11.7f %11.7f %11.7f" % (r[0, 1], r[0, 2], r[1, 2]))
r01 = spearmanr(x[0], x[1], nan_policy='omit')[0]
r02 = spearmanr(x[0], x[2], nan_policy='omit')[0]
r12 = spearmanr(x[1], x[2], nan_policy='omit')[0]
print("spearmanr, individual: %11.7f %11.7f %11.7f" % (r01, r02, r12))
df = pd.DataFrame(x.T)
c = df.corr('spearman')
print("Pandas df.corr('spearman'): %11.7f %11.7f %11.7f" % (c[0][1], c[0][2], c[1][2]))
print("R cor.test: 0.2051957 0.4857143 -0.4707919")
print(' (method="spearman", continuity=FALSE)')
"""
# R code:
> x0 = c(NA, 3, 4, 5, 5.1, 6.0, 9.2)
> x1 = c(5.0, NA, 4.1, 4.8, 4.9, 5.0, 4.1)
> x2 = c(0.5, 4.0, 7.1, 3.8, 8.0, 5.1, 7.6)
> cor.test(x0, x1, method="spearman", continuity=FALSE)
> cor.test(x0, x2, method="spearman", continuity=FALSE)
> cor.test(x1, x2, method="spearman", continuity=FALSE)
"""
输出:
spearmanr, array: -0.0727393 -0.0714286 -0.4728054
spearmanr, individual: 0.2051957 0.4857143 -0.4707919
Pandas df.corr('spearman'): 0.2051957 0.4857143 -0.4707919
R cor.test: 0.2051957 0.4857143 -0.4707919
(method="spearman", continuity=FALSE)
我的建议是不要使用 scipy.stats.spearmanr
格式为 spearmanr(x,nan_policy ='omit',axis =< whatever>)
。使用 corr()$ c Pandas DataFrame的$ c>方法,或使用循环使用
spearmanr(x0,x1,nan_policy ='omit')
成对计算值。
My suggestion is to not use scipy.stats.spearmanr
in the form spearmanr(x, nan_policy='omit', axis=<whatever>)
. Use the corr()
method of the Pandas DataFrame, or use a loop to compute the values pairwise using spearmanr(x0, x1, nan_policy='omit')
.
这篇关于矩阵的Python Scipy Spearman相关性与两数组相关性也不匹配pandas.Data.Frame.corr()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!