DataFrame中各列之间的相关性 [英] Correlation between columns in DataFrame

查看:554
本文介绍了DataFrame中各列之间的相关性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对熊猫还很陌生,所以我想我做错了-

I'm pretty new to pandas, so I guess I'm doing something wrong -

我有一个DataFrame:

I have a DataFrame:

     a     b
0  0.5  0.75
1  0.5  0.75
2  0.5  0.75
3  0.5  0.75
4  0.5  0.75

df.corr()给我:

    a   b
a NaN NaN
b NaN NaN

但是np.correlate(df["a"], df["b"])给出:1.875

那是为什么? 我想为我的DataFrame使用相关矩阵,并认为corr()可以做到这一点(至少根据文档).为什么返回NaN?

Why is that? I want to have the correlation matrix for my DataFrame and thought that corr() does that (at least according to the documentation). Why does it return NaN?

正确的计算方法是什么?

What's the correct way to calculate?

非常感谢!

推荐答案

np.correlate 计算两个一维序列之间的(未归一化的)互相关:

np.correlate calculates the (unnormalized) cross-correlation between two 1-dimensional sequences:

z[k] = sum_n a[n] * conj(v[n+k])

df.corr (默认情况下)计算 Pearson相关系数.

while df.corr (by default) calculates the Pearson correlation coefficient.

相关系数(如果存在)始终在-1和1之间(包括1和1). 互相关不受限制.

The correlation coefficient (if it exists) is always between -1 and 1 inclusive. The cross-correlation is not bounded.

这些公式有些相关,但是请注意,在上述互相关公式中,均值没有相减,也没有除以标准差(这是Pearson相关系数公式的一部分).

The formulas are somewhat related, but notice that in the cross-correlation formula (above) there is no subtraction of the means, and no division by the standard deviations which is part of the formula for Pearson correlation coefficient.

df['a']df['b']的标准偏差为零的事实是导致df.corr到处都是NaN的原因.

The fact that the standard deviation of df['a'] and df['b'] is zero is what causes df.corr to be NaN everywhere.

在下面的评论中,听起来您正在寻找 Beta .它与Pearson的相关系数有关,而不是除以标准差的乘积:

From the comment below, it sounds like you are looking for Beta. It is related to Pearson's correlation coefficient, but instead of dividing by the product of standard deviations:

您除以方差:

您可以使用 np.cov Beta >

cov = np.cov(a, b)
beta = cov[1, 0] / cov[0, 0]


import numpy as np
import matplotlib.pyplot as plt
np.random.seed(100)


def geometric_brownian_motion(T=1, N=100, mu=0.1, sigma=0.01, S0=20):
    """
    http://stackoverflow.com/a/13203189/190597 (unutbu)
    """
    dt = float(T) / N
    t = np.linspace(0, T, N)
    W = np.random.standard_normal(size=N)
    W = np.cumsum(W) * np.sqrt(dt)  # standard brownian motion ###
    X = (mu - 0.5 * sigma ** 2) * t + sigma * W
    S = S0 * np.exp(X)  # geometric brownian motion ###
    return S

N = 10 ** 6
a = geometric_brownian_motion(T=1, mu=0.1, sigma=0.01, N=N)
b = geometric_brownian_motion(T=1, mu=0.2, sigma=0.01, N=N)

cov = np.cov(a, b)
print(cov)
# [[ 0.38234755  0.80525967]
#  [ 0.80525967  1.73517501]]
beta = cov[1, 0] / cov[0, 0]
print(beta)
# 2.10609347015

plt.plot(a)
plt.plot(b)
plt.show()

mu s的比率是2,beta的比率是〜2.1.

The ratio of mus is 2, and beta is ~2.1.

您也可以使用df.corr进行计算,尽管这是一种更全面的方法(但是很高兴看到一致性):

And you could also compute it with df.corr, though this is a much more round-about way of doing it (but it is nice to see there is consistency):

import pandas as pd
df = pd.DataFrame({'a': a, 'b': b})
beta2 = (df.corr() * df['b'].std() * df['a'].std() / df['a'].var()).ix[0, 1]
print(beta2)
# 2.10609347015
assert np.allclose(beta, beta2)

这篇关于DataFrame中各列之间的相关性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆