测试数据的R平方 [英] R-squared on test data

查看:291
本文介绍了测试数据的R平方的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对数据集的75%拟合了线性回归模型,该模型包含约11000个观测值和143个变量:

I fit a linear regression model on 75% of my data set that includes ~11000 observations and 143 variables:

gl.fit <- lm(y[1:ceiling(length(y)*(3/4))] ~ ., data= x[1:ceiling(length(y)*(3/4)),]) #3/4 for training

,我的R ^ 2为0.43.然后,我尝试使用其余数据对测试数据进行预测:

, and I got an R^2 of 0.43. I then tried predicting on my test data using the rest of the data:

ytest=y[(ceiling(length(y)*(3/4))+1):length(y)] x.test <- cbind(1,x[(ceiling(length(y)*(3/4))+1):length(y),]) #The rest for test yhat <- as.matrix(x.test)%*%gl.fit$coefficients #Calculate the predicted values

ytest=y[(ceiling(length(y)*(3/4))+1):length(y)] x.test <- cbind(1,x[(ceiling(length(y)*(3/4))+1):length(y),]) #The rest for test yhat <- as.matrix(x.test)%*%gl.fit$coefficients #Calculate the predicted values

我现在想在测试数据上计算R ^ 2值.有什么简单的方法可以计算出来吗?

I now would like to calculate the R^2 value on my test data. Is there any easy way to calculate that?

谢谢

推荐答案

这里有两个问题.首先,这不是使用lm(...)的好方法. lm(...)旨在与数据框一起使用,公式表达式引用df中的列.因此,假设您的数据位于两个向量xy

There are a couple of problems here. First, this is not a good way to use lm(...). lm(...) is meant to be used with a data frame, with the formula expressions referencing columns in the df. So, assuming your data is in two vectors x and y,

set.seed(1)    # for reproducible example
x <- 1:11000
y <- 3+0.1*x + rnorm(11000,sd=1000)

df <- data.frame(x,y)
# training set
train <- sample(1:nrow(df),0.75*nrow(df))   # random sample of 75% of data

fit <- lm(y~x,data=df[train,])

现在fit具有基于训练集的模型.以此方式使用lm(...),例如,您可以在不进行所有矩阵乘法的情况下生成预测.

Now fit has the model based on the training set. Using lm(...) this way allows you, for example to generate predictions without all the matrix multiplication.

第二个问题是R平方的定义. 常规定义为:

The second problem is the definition of R-squared. The conventional definition is:

1-SS.residuals/SS.total

1 - SS.residuals/SS.total

对于训练集,和仅训练集

SS.total = SS.回归+ SS.残差

SS.total = SS.regression + SS.residual

如此

SS.regression = SS.total-SS.residual,

SS.regression = SS.total - SS.residual,

因此

R.sq = SS.regression/SS.total

R.sq = SS.regression/SS.total

所以R.sq是模型解释的数据集中变异性的分数,并且始终在0到1之间.

so R.sq is the fraction of variability in the dataset that is explained by the model, and will always be between 0 and 1.

您可以在下面看到它.

SS.total      <- with(df[train,],sum((y-mean(y))^2))
SS.residual   <- sum(residuals(fit)^2)
SS.regression <- sum((fitted(fit)-mean(df[train,]$y))^2)
SS.total - (SS.regression+SS.residual)
# [1] 1.907349e-06
SS.regression/SS.total     # fraction of variation explained by the model
# [1] 0.08965502
1-SS.residual/SS.total     # same thing, for model frame ONLY!!! 
# [1] 0.08965502          
summary(fit)$r.squared     # both are = R.squared
# [1] 0.08965502

但是 不适用于测试集(例如,当您从模型进行预测时).

But this does not work with the test set (e.g., when you make predictions from a model).

test <- -train
test.pred <- predict(fit,newdata=df[test,])
test.y    <- df[test,]$y

SS.total      <- sum((test.y - mean(test.y))^2)
SS.residual   <- sum((test.y - test.pred)^2)
SS.regression <- sum((test.pred - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
# [1] 8958890

# NOT the fraction of variability explained by the model
test.rsq <- 1 - SS.residual/SS.total  
test.rsq
# [1] 0.0924713

# fraction of variability explained by the model
SS.regression/SS.total 
# [1] 0.08956405

在这个人为的示例中,差异不大,但是很有可能具有R-sq.值小于0(以这种方式定义时).

In this contrived example there is not much difference, but it is very possible to have an R-sq. value less than 0 (when defined this way).

例如,如果该模型对于测试集而言是非常差的预测指标,则残差实际上可能大于测试集的总变化量.这等同于说,使用测试集比使用从训练集衍生的模型更好地对测试集进行建模.

If, for example, the model is a very poor predictor with the test set, then the residuals can actually be larger than the total variation in test set. This is equivalent to saying that the test set is modeled better using it's mean, than using the model derived from the training set.

我注意到您使用数据的前四分之三作为训练集,而不是随机抽样(如本例所示).如果yx的依赖性是非线性的,并且x的依序性良好,则可以得到带有测试集的负R-sq.

I noticed that you use the first three quarters of your data as the training set, rather than taking a random sample (as in this example). If the dependance of y on x is non-linear, and the x's are in order, then you could get a negative R-sq with the test set.

关于以下OP的评论,一种使用测试集评估模型的方法是通过比较模型内和模型外均方误差(MSE).

Regarding OP's comment below, one way to assess the model with a test set is by comparing in-model to out-of-model mean squared error (MSE).

mse.train <- summary(fit)$sigma^2
mse.test  <- sum((test.pred - test.y)^2)/(nrow(df)-length(train)-2)

如果我们假设训练集和测试集都是正态分布且具有相同的方差,并且均值遵循相同的模型公式,则该比率应具有(n.train-2)和(n .test-2)自由度.如果基于F检验的MSE明显不同,则该模型很好地拟合了测试数据.

If we assume that the training and test set are both normally distributed with the same variance and having means which follow the same model formula, then the ratio should have an F-distribution with (n.train-2) and (n.test-2) degrees of freedom. If the MSE's are significantly different based on an F-test, then the model does not fit the test data well.

您是否已绘制了test.y和pred.y与x?仅此一项就能告诉您很多.

Have you plotted your test.y and pred.y vs x?? This alone will tell you a lot.

这篇关于测试数据的R平方的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆