pandas 的加权相关系数 [英] Weighted correlation coefficient with pandas

查看:208
本文介绍了 pandas 的加权相关系数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法计算大熊猫的加权相关系数?我看到R有这种方法. 另外,我想获得相关性的p值.我在R中也没有找到. 链接至Wikipedia以获取有关加权相关的解释: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Weighted_correlation_coefficient

Is there any way to compute weighted correlation coefficient with pandas? I saw that R has such a method. Also, I'd like to get the p value of the correlation. This I did not find also in R. Link to Wikipedia for explanation about weighted correlation: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Weighted_correlation_coefficient

推荐答案

我不知道有任何实现此功能的Python包,但是滚动自己的实现应该非常简单.使用维基百科文章的命名约定:

I don't know of any Python packages that implement this, but it should be fairly straightforward to roll your own implementation. Using the naming conventions of the wikipedia article:

def m(x, w):
    """Weighted Mean"""
    return np.sum(x * w) / np.sum(w)

def cov(x, y, w):
    """Weighted Covariance"""
    return np.sum(w * (x - m(x, w)) * (y - m(y, w))) / np.sum(w)

def corr(x, y, w):
    """Weighted Correlation"""
    return cov(x, y, w) / np.sqrt(cov(x, x, w) * cov(y, y, w))

我试图使以上功能尽可能地与Wikipedia中的公式匹配,但是存在一些潜在的简化和性能改进.例如,正如@Alberto Garcia-Raboso指出的那样,m(x, w)实际上只是np.average(x, weights=w),因此不需要实际为其编写函数.

I tried to make the functions above match the formulas in the wikipedia as closely as possible, but there are some potential simplifications and performance improvements. For example, as pointed out by @Alberto Garcia-Raboso, m(x, w) is really just np.average(x, weights=w), so there's no need to actually write a function for it.

这些函数非常简单,仅进行计算即可.您可能需要考虑在进行计算之前将输入强制为数组,即x = np.asarray(x),因为如果传递列表,这些函数将不起作用.还可以执行其他检查,以验证所有输入具有相等的长度,非空值等.

The functions are pretty bare-bones, just doing the calculations. You may want to consider forcing inputs to be arrays prior to doing the calculations, i.e. x = np.asarray(x), as these functions will not work if lists are passed. Additional checks to verify all inputs have equal length, non-null values, etc. could also be implemented.

示例用法:

# Initialize a DataFrame.
np.random.seed([3,1415])
n = 10**6
df = pd.DataFrame({
    'x': np.random.choice(3, size=n),
    'y': np.random.choice(4, size=n),
    'w': np.random.random(size=n)
    })

# Compute the correlation.
r = corr(df['x'], df['y'], df['w'])

此处有一个关于p值的讨论.看起来好像没有通用的计算方法,它取决于您实际获得权重的方式.

There's a discussion here regarding the p-value. It doesn't look like there's a generic calculation, and it depends on how you're actually getting the weights.

这篇关于 pandas 的加权相关系数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆