pandas 列相关性具有统计学意义 [英] pandas columns correlation with statistical significance

查看：113 发布时间：2020/5/23 22:16:54 python pandas scipy correlation

本文介绍了 pandas 列相关性具有统计学意义的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在给定熊猫数据帧df的情况下，获取其列df.1和df.2之间的相关性的最佳方法是什么?

What is the best way, given a pandas dataframe, df, to get the correlation between its columns df.1 and df.2?

我不希望输出对NaN进行行计数，而pandas内置关联功能会这样做.但我也希望它输出pvalue或标准错误，而内置错误则不会.

I do not want the output to count rows with NaN, which pandas built-in correlation does. But I also want it to output a pvalue or a standard error, which the built-in does not.

SciPy似乎被NaN赶上了，尽管我认为它确实具有重要意义.

SciPy seems to get caught up by the NaNs, though I believe it does report significance.

数据示例:

     1           2
0    2          NaN
1    NaN         1
2    1           2
3    -4          3
4    1.3         1
5    NaN         NaN

推荐答案

@Shashank提供的答案很好.但是，如果您想使用纯pandas的解决方案，则可能会这样:

Answer provided by @Shashank is nice. However, if you want a solution in pure pandas, you may like this:

import pandas as pd
from pandas.io.data import DataReader
from datetime import datetime
import scipy.stats  as stats


gdp = pd.DataFrame(DataReader("GDP", "fred", start=datetime(1990, 1, 1)))
vix = pd.DataFrame(DataReader("VIXCLS", "fred", start=datetime(1990, 1, 1)))

#Do it with a pandas regression to get the p value from the F-test
df = gdp.merge(vix,left_index=True, right_index=True, how='left')
vix_on_gdp = pd.ols(y=df['VIXCLS'], x=df['GDP'], intercept=True)
print(df['VIXCLS'].corr(df['GDP']), vix_on_gdp.f_stat['p-value'])

结果:

-0.0422917932738 0.851762475093

与统计功能的结果相同:

Same results as stats function:

#Do it with stats functions. 
df_clean = df.dropna()
stats.pearsonr(df_clean['VIXCLS'], df_clean['GDP'])

结果:

  (-0.042291793273791969, 0.85176247509284908)

要扩展到更多可变商品，我为您提供了一个基于丑陋循环的方法:

To extend to more vairables I give you an ugly loop based approach:

#Add a third field
oil = pd.DataFrame(DataReader("DCOILWTICO", "fred", start=datetime(1990, 1, 1))) 
df = df.merge(oil,left_index=True, right_index=True, how='left')

#construct two arrays, one of the correlation and the other of the p-vals
rho = df.corr()
pval = np.zeros([df.shape[1],df.shape[1]])
for i in range(df.shape[1]): # rows are the number of rows in the matrix.
    for j in range(df.shape[1]):
        JonI        = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True)
        pval[i,j]  = JonI.f_stat['p-value']

rho的结果:

             GDP    VIXCLS  DCOILWTICO
 GDP         1.000000 -0.042292    0.870251
 VIXCLS     -0.042292  1.000000   -0.004612
 DCOILWTICO  0.870251 -0.004612    1.000000

pval的结果:

 [[  0.00000000e+00   8.51762475e-01   1.11022302e-16]
  [  8.51762475e-01   0.00000000e+00   9.83747425e-01]
  [  1.11022302e-16   9.83747425e-01   0.00000000e+00]]

这篇关于 pandas 列相关性具有统计学意义的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 列相关性具有统计学意义 [英] pandas columns correlation with statistical significance

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 列相关性具有统计学意义 [英] pandas columns correlation with statistical significance

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭