从相关系数计算中删除异常值 [英] Remove outliers from correlation coefficient calculation

查看:487
本文介绍了从相关系数计算中删除异常值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有两个数值向量 x y x y 之间的皮尔逊相关系数由

Assume we have two numeric vectors x and y. The Pearson correlation coefficient between x and y is given by

cor(x,y)

如何仅自动考虑 x y 在计算中(例如90%)以最大化相关系数?

How can I automatically consider only a subset of x and y in the calculation (say 90%) as to maximize the correlation coefficient?

推荐答案

如果您确实要这样做(删除最大(绝对)残差),则可以使用线性模型来估计最小二乘解和相关的残差,然后选择数据的中间n%。这是一个示例:

If you really want to do this (remove the largest (absolute) residuals), then we can employ the linear model to estimate the least squares solution and associated residuals and then select the middle n% of the data. Here is an example:

首先,生成一些虚拟数据:

Firstly, generate some dummy data:

require(MASS) ## for mvrnorm()
set.seed(1)
dat <- mvrnorm(1000, mu = c(4,5), Sigma = matrix(c(1,0.8,1,0.8), ncol = 2))
dat <- data.frame(dat)
names(dat) <- c("X","Y")
plot(dat)

接下来,我们拟合线性模型并提取残差:

Next, we fit the linear model and extract the residuals:

res <- resid(mod <- lm(Y ~ X, data = dat))

quantile()函数可以为我们提供所需的残差分位数。您建议保留90%的数据,因此我们希望上下0.05个分位数:

The quantile() function can give us the required quantiles of the residuals. You suggested retaining 90% of the data, so we want the upper and lower 0.05 quantiles:

res.qt <- quantile(res, probs = c(0.05,0.95))

选择中间值为90的那些观测值数据百分比:

Select those observations with residuals in the middle 90% of the data:

want <- which(res >= res.qt[1] & res <= res.qt[2])

然后我们可以将其可视化,红点是那些将保留:

We can then visualise this, with the red points being those we will retain:

plot(dat, type = "n")
points(dat[-want,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[want,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

< img src = https://i.stack.imgur.com/gaOp1.png alt =由虚拟数据生成的图,显示了具有最小残差的选定点>

完整数据与所选子集的相关性为:

The correlations for the full data and the selected subset are:

> cor(dat)
          X         Y
X 1.0000000 0.8935235
Y 0.8935235 1.0000000
> cor(dat[want,])
          X         Y
X 1.0000000 0.9272109
Y 0.9272109 1.0000000
> cor(dat[-want,])
         X        Y
X 1.000000 0.739972
Y 0.739972 1.000000

请注意,在这里我们可能会抛出非常好的数据,因为我们只选择残差最大的5%和负最大的5%。另一种选择是选择90%的残差最小的 absolute

Be aware that here we might be throwing out perfectly good data, because we just choose the 5% with largest positive residuals and 5% with the largest negative. An alternative is to select the 90% with smallest absolute residuals:

ares <- abs(res)
absres.qt <- quantile(ares, prob = c(.9))
abswant <- which(ares <= absres.qt)
## plot - virtually the same, but not quite
plot(dat, type = "n")
points(dat[-abswant,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[abswant,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

有了这个略有不同的子集,相关性略低:

With this slightly different subset, the correlation is slightly lower:

> cor(dat[abswant,])
          X         Y
X 1.0000000 0.9272032
Y 0.9272032 1.0000000

另一点是,即使如此,我们仍会丢弃良好的数据。您可能希望将库克距离视为离群值强度的度量,而仅丢弃那些超过库克距离某个阈值的值。 维基百科包含库克距离和建议阈值的信息。 cooks.distance()函数可用于从 mod 中获取值:

Another point is that even then we are throwing out good data. You might want to look at Cook's distance as a measure of the strength of the outliers, and discard only those values above a certain threshold Cook's distance. Wikipedia has info on Cook's distance and proposed thresholds. The cooks.distance() function can be used to retrieve the values from mod:

> head(cooks.distance(mod))
           1            2            3            4            5            6 
7.738789e-04 6.056810e-04 6.375505e-04 4.338566e-04 1.163721e-05 1.740565e-03

,如果您计算了Wikipedia上建议的阈值,则仅删除那些超过阈值的阈值。对于这些数据:

and if you compute the threshold(s) suggested on Wikipedia and remove only those that exceed the threshold. For these data:

> any(cooks.distance(mod) > 1)
[1] FALSE
> any(cooks.distance(mod) > (4 * nrow(dat)))
[1] FALSE

没有一个库克的距离超过建议的阈值(考虑到我生成数据的方式,这不足为奇。)

none of the Cook's distances exceed the proposed thresholds (not surprising given the way I generated the data.)

已经说了所有这些,为什么你想这样做吗?如果您只是想摆脱数据来改善相关性或建立重要的关系,那听起来有点像鱼,就像对我的数据疏dr。

Having said all of this, why do you want to do this? If you are just trying to get rid of data to improve a correlation or generate a significant relationship, that sounds a bit fishy and bit like data dredging to me.

这篇关于从相关系数计算中删除异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆