pandas 列相关性具有统计学意义 [英] pandas columns correlation with statistical significance
问题描述
在给定熊猫数据帧df的情况下,获取其列df.1
和df.2
之间的相关性的最佳方法是什么?
What is the best way, given a pandas dataframe, df, to get the correlation between its columns df.1
and df.2
?
我不希望输出对NaN
进行行计数,而pandas
内置关联功能会这样做.但我也希望它输出pvalue
或标准错误,而内置错误则不会.
I do not want the output to count rows with NaN
, which pandas
built-in correlation does. But I also want it to output a pvalue
or a standard error, which the built-in does not.
SciPy
似乎被NaN赶上了,尽管我认为它确实具有重要意义.
SciPy
seems to get caught up by the NaNs, though I believe it does report significance.
数据示例:
1 2
0 2 NaN
1 NaN 1
2 1 2
3 -4 3
4 1.3 1
5 NaN NaN
推荐答案
@Shashank提供的答案很好.但是,如果您想使用纯pandas
的解决方案,则可能会这样:
Answer provided by @Shashank is nice. However, if you want a solution in pure pandas
, you may like this:
import pandas as pd
from pandas.io.data import DataReader
from datetime import datetime
import scipy.stats as stats
gdp = pd.DataFrame(DataReader("GDP", "fred", start=datetime(1990, 1, 1)))
vix = pd.DataFrame(DataReader("VIXCLS", "fred", start=datetime(1990, 1, 1)))
#Do it with a pandas regression to get the p value from the F-test
df = gdp.merge(vix,left_index=True, right_index=True, how='left')
vix_on_gdp = pd.ols(y=df['VIXCLS'], x=df['GDP'], intercept=True)
print(df['VIXCLS'].corr(df['GDP']), vix_on_gdp.f_stat['p-value'])
结果:
-0.0422917932738 0.851762475093
与统计功能的结果相同:
Same results as stats function:
#Do it with stats functions.
df_clean = df.dropna()
stats.pearsonr(df_clean['VIXCLS'], df_clean['GDP'])
结果:
(-0.042291793273791969, 0.85176247509284908)
要扩展到更多可变商品,我为您提供了一个基于丑陋循环的方法:
To extend to more vairables I give you an ugly loop based approach:
#Add a third field
oil = pd.DataFrame(DataReader("DCOILWTICO", "fred", start=datetime(1990, 1, 1)))
df = df.merge(oil,left_index=True, right_index=True, how='left')
#construct two arrays, one of the correlation and the other of the p-vals
rho = df.corr()
pval = np.zeros([df.shape[1],df.shape[1]])
for i in range(df.shape[1]): # rows are the number of rows in the matrix.
for j in range(df.shape[1]):
JonI = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True)
pval[i,j] = JonI.f_stat['p-value']
rho的结果:
GDP VIXCLS DCOILWTICO
GDP 1.000000 -0.042292 0.870251
VIXCLS -0.042292 1.000000 -0.004612
DCOILWTICO 0.870251 -0.004612 1.000000
pval的结果:
[[ 0.00000000e+00 8.51762475e-01 1.11022302e-16]
[ 8.51762475e-01 0.00000000e+00 9.83747425e-01]
[ 1.11022302e-16 9.83747425e-01 0.00000000e+00]]
这篇关于 pandas 列相关性具有统计学意义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!