在 R 中计算 R2(R 平方)的函数 [英] Function to calculate R2 (R-squared) in R

查看:86
本文介绍了在 R 中计算 R2(R 平方)的函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含观察和建模数据的数据框,我想计算 R2 值.我希望有一个我可以为此调用的函数,但找不到一个.我知道我可以自己编写并应用它,但是我是否遗漏了一些明显的东西?我想要类似的东西

obs <- 1:5mod <- c(0.8,2.4,2,3,4.8)df <- data.frame(obs, mod)R2 <- rsq(df)# 0.85

解决方案

您需要一点统计知识才能看到这一点.两个向量之间的 R 平方只是

引理 2:beta = cov(x, y)/var(x)

引理 3:R.square = cor(x, y) ^ 2


警告

两个任意向量 xy(相同长度)之间的

R 平方只是它们线性关系的良好度量.三思而后行!!x + ay + b 之间的 R 平方对于任何常量移位 ab 都是相同的.因此,对于预测的优劣"而言,它是一种微弱甚至无用的衡量标准.改用 MSE 或 RMSE:

我同意42-的评论:

<块引用>

R 平方由与回归函数相关的汇总函数报告.但只有当这样的估计在统计上是合理的.

R 平方可以是(但不是最好的)拟合优度"的度量.但是没有理由证明它可以衡量样本外预测的好坏.如果您将数据分成训练和测试部分,并在训练部分拟合回归模型,则可以在训练部分获得有效的 R 平方值,但不能在测试部分合法地计算 R 平方.有些人这样做了,但我不同意.

这是一个非常极端的例子:

preds <- 1:4/4实际 <- 1:4

这两个向量之间的 R 平方是 1.当然,一个只是另一个向量的线性缩放,因此它们具有完美的线性关系.但是,你真的认为 preds 是对 actual 的一个很好的预测吗??


回复wordsforthewise

感谢您的意见1, 2您对细节的回答.

您可能误解了程序.给定两个向量 xy,我们首先拟合回归线 y ~ x 然后计算回归平方和和总平方和.看起来您跳过此回归步骤并直接进行平方和计算.这是错误的,因为 平方和的划分 不成立,您无法计算 R以一致的方式平方.

正如您所展示的,这只是计算 R 平方的一种方法:

preds <- c(1, 2, 3)实际 <- c(2, 2, 4)rss <- sum((preds - actual) ^ 2) ##残差平方和tss <- sum((actual - mean(actual)) ^ 2) ##总平方和rsq <- 1 - rss/tss#[1] 0.25

但是还有一个:

regss <- sum((preds - mean(preds)) ^ 2) ##回归平方和regss/tss#[1] 0.75

此外,您的公式可以给出一个负值(正确的值应该是 1,如上面警告部分所述).

preds <- 1:4/4实际 <- 1:4rss <- sum((preds - actual) ^ 2) ##残差平方和tss <- sum((actual - mean(actual)) ^ 2) ##总平方和rsq <- 1 - rss/tss#[1] -2.375


最后评论

当我在两年前发布最初的答案时,我从没想过这个答案最终会这么长.然而,鉴于此线程的高评价,我觉得有必要添加更多的统计细节和讨论.我不想误导人们,因为他们可以如此轻松地计算 R 平方,因此他们可以在任何地方使用 R 平方.

I have a dataframe with observed and modelled data, and I would like to calculate the R2 value. I expected there to be a function I could call for this, but can't locate one. I know I can write my own and apply it, but am I missing something obvious? I want something like

obs <- 1:5
mod <- c(0.8,2.4,2,3,4.8)
df <- data.frame(obs, mod)

R2 <- rsq(df)
# 0.85

解决方案

You need a little statistical knowledge to see this. R squared between two vectors is just the square of their correlation. So you can define you function as:

rsq <- function (x, y) cor(x, y) ^ 2

Sandipan's answer will return you exactly the same result (see the following proof), but as it stands it appears more readable (due to the evident $r.squared).


Let's do the statistics

Basically we fit a linear regression of y over x, and compute the ratio of regression sum of squares to total sum of squares.

lemma 1: a regression y ~ x is equivalent to y - mean(y) ~ x - mean(x)

lemma 2: beta = cov(x, y) / var(x)

lemma 3: R.square = cor(x, y) ^ 2


Warning

R squared between two arbitrary vectors x and y (of the same length) is just a goodness measure of their linear relationship. Think twice!! R squared between x + a and y + b are identical for any constant shift a and b. So it is a weak or even useless measure on "goodness of prediction". Use MSE or RMSE instead:

I agree with 42-'s comment:

The R squared is reported by summary functions associated with regression functions. But only when such an estimate is statistically justified.

R squared can be a (but not the best) measure of "goodness of fit". But there is no justification that it can measure the goodness of out-of-sample prediction. If you split your data into training and testing parts and fit a regression model on the training one, you can get a valid R squared value on training part, but you can't legitimately compute an R squared on the test part. Some people did this, but I don't agree with it.

Here is very extreme example:

preds <- 1:4/4
actual <- 1:4

The R squared between those two vectors is 1. Yes of course, one is just a linear rescaling of the other so they have a perfect linear relationship. But, do you really think that the preds is a good prediction on actual??


In reply to wordsforthewise

Thanks for your comments 1, 2 and your answer of details.

You probably misunderstood the procedure. Given two vectors x and y, we first fit a regression line y ~ x then compute regression sum of squares and total sum of squares. It looks like you skip this regression step and go straight to the sum of square computation. That is false, since the partition of sum of squares does not hold and you can't compute R squared in a consistent way.

As you demonstrated, this is just one way for computing R squared:

preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2)  ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2)  ## total sum of squares
rsq <- 1 - rss/tss
#[1] 0.25

But there is another:

regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares
regss / tss
#[1] 0.75

Also, your formula can give a negative value (the proper value should be 1 as mentioned above in the Warning section).

preds <- 1:4 / 4
actual <- 1:4
rss <- sum((preds - actual) ^ 2)  ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2)  ## total sum of squares
rsq <- 1 - rss/tss
#[1] -2.375


Final remark

I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago. However, given the high views of this thread, I feel obliged to add more statistical details and discussions. I don't want to mislead people that just because they can compute an R squared so easily, they can use R squared everywhere.

这篇关于在 R 中计算 R2(R 平方)的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆