scikit学习statsmodels-哪个R平方是正确的? [英] scikit-learn & statsmodels - which R-squared is correct?

查看:74
本文介绍了scikit学习statsmodels-哪个R平方是正确的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为将来选择最好的算法.我找到了一些解决方案,但我不知道哪个R-Squared值正确.

I'd like to choose the best algorithm for future. I found some solutions, but I didn't understand which R-Squared value is correct.

为此,我将我的数据分为两个测试和训练,并在下面打印了两个不同的R平方值.

For this, I divided my data into two as test and training, and I printed two different R squared values ​​below.

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

lineer = LinearRegression()
lineer.fit(x_train,y_train)
lineerPredict = lineer.predict(x_test)

scoreLineer = r2_score(y_test, lineerPredict)  # First R-Squared

model = sm.OLS(lineerPredict, y_test)
print(model.fit().summary()) # Second R-Squared

第一个R平方结果是-4.28.
第二个R平方结果为0.84

First R-Squared result is -4.28.
Second R-Squared result is 0.84

但是我不知道哪个值是正确的.

But I didn't understand which value is correct.

推荐答案

可以说,在这种情况下,真正的挑战是确保将苹果与苹果进行比较.在您的情况下,您似乎没有.我们最好的朋友总是相关的文档,以及简单的实验.所以...

Arguably, the real challenge in such cases is to be sure that you compare apples to apples. And in your case, it seems that you don't. Our best friend is always the relevant documentation, combined with simple experiments. So...

尽管scikit-learn的LinearRegression()(即您的第一个R平方)默认情况下与fit_intercept=True(文档:

Although scikit-learn's LinearRegression() (i.e. your 1st R-squared) is fitted by default with fit_intercept=True (docs), this is not the case with statsmodels' OLS (your 2nd R-squared); quoting from the docs:

默认情况下不包括拦截器,用户应添加.参见statsmodels.tools.add_constant.

牢记这一重要细节,让我们对虚拟数据进行一些简单的实验:

Keeping this important detail in mind, let's run some simple experiments with dummy data:

import numpy as np
import statsmodels.api as sm
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

# dummy data:
y = np.array([1,3,4,5,2,3,4])
X = np.array(range(1,8)).reshape(-1,1) # reshape to column

# scikit-learn:
lr = LinearRegression()
lr.fit(X,y)
# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
#     normalize=False)

lr.score(X,y)
# 0.16118421052631582

y_pred=lr.predict(X)
r2_score(y, y_pred)
# 0.16118421052631582


# statsmodels
# first artificially add intercept to X, as advised in the docs:
X_ = sm.add_constant(X)

model = sm.OLS(y,X_) # X_ here
results = model.fit()
results.rsquared
# 0.16118421052631593

出于所有实际目的,由scikit-learn和statsmodels生成的R平方的这两个值相同.

For all practical purposes, these two values of R-squared produced by scikit-learn and statsmodels are identical.

让我们更进一步,尝试一个没有拦截的scikit-learn模型,但是在这里我们使用了已经人工构建的拦截"数据X_,该模型已经与statsmodels一起使用:

Let's go a step further, and try a scikit-learn model without intercept, but where we use the artificially "intercepted" data X_ we have already built for use with statsmodels:

lr2 = LinearRegression(fit_intercept=False)
lr2.fit(X_,y) # X_ here
# LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None,
#         normalize=False)

lr2.score(X_, y)
# 0.16118421052631593

y_pred2 = lr2.predict(X_)
r2_score(y, y_pred2)
# 0.16118421052631593

同样,R平方与以前的值相同.

Again, the R-squared is identical with the previous values.

那么,当我们偶然地"忘记考虑statsmodels OLS的拟合而没有拦截时,会发生什么呢?让我们看看:

So, what happens when we "accidentally" forget to account for the fact that statsmodels OLS is fitted without an intercept? Let's see:

model3 = sm.OLS(y,X) # X here, i.e. no intercept
results3 = model2.fit()
results3.rsquared
# 0.8058035714285714

好吧,R平方的0.80确实与模型带有的拦截器返回的0.16相差很远,并且可以说,这正是您所遇到的情况.

Well, an R-squared of 0.80 is indeed very far from the one of 0.16 returned by a model with an intercept, and arguably this is exactly what has happened in your case.

到目前为止,一切都很好,我可以在这里轻松完成答案;但是确实有一个和谐世界崩溃的地方:让我们看看当我们在没有截距的情况下对两个模型进行拟合并且使用初始数据X时会发生什么,而我们没有人为地添加任何截距.我们已经拟合了上面的OLS模型,并得到0.80的R平方; scikit-learn的类似模型怎么样?

So far so good, and I could easily finish the answer here; but there is indeed a point where this harmonious world breaks down: let's see what happens when we fit both models without intercept and with the initial data X where we have not artificially added any interception. We have already fitted the OLS model above, and got an R-squared of 0.80; what about a similar model from scikit-learn?

# scikit-learn
lr3 = LinearRegression(fit_intercept=False)
lr3.fit(X,y) # X here
lr3.score(X,y)
# -0.4309210526315792

y_pred3 = lr3.predict(X)
r2_score(y, y_pred3)
# -0.4309210526315792

糟糕...!什么鬼?

在计算r2_score时,似乎scikit-earn总是假定一个拦截,无论是在模型(fit_intercept=True)中是显式的还是在数据中(我们生成的方式)是隐式的上面X中的X_,使用statsmodels的add_constant);在网上稍作挖掘会发现 Github线程(无补救措施而封闭)在哪里事实证明确实如此.

It seems that scikit-earn, when computes the r2_score, always assumes an intercept, either explicitly in the model (fit_intercept=True) or implicitly in the data (the way we have produced X_ from X above, using statsmodels' add_constant); digging a little online reveals a Github thread (closed without a remedy) where it is confirmed that the situation is indeed like that.

让我澄清一下,我上面描述的差异与您的问题没有任何关联:在您的情况下,真正的问题是您实际上是在比较苹果(带有截距的模型)和橙子(没有拦截的模型).

Let me clarify that the discrepancy I have described above has nothing to do with your issue: in your case, the real issue is that you are actually comparing apples (a model with intercept) with oranges (a model without intercept).

所以,为什么scikit-learn不仅在这种(公认的 edge )情况下失败,而且即使事实出现在Github问题中,它实际上也会被 indifference ? (还请注意,在上述线程中回复的scikit-learn核心开发人员随便承认"我对统计数据不是很熟悉".).

So, why scikit-learn not only fails in such an (admittedly edge) case, but even when the fact emerges in a Github issue it is actually treated with indifference? (Notice also that the scikit-learn core developer who replies in the above thread casually admits that "I'm not super familiar with stats"...).

答案超出了编码问题,例如SO主要涉及的问题,但是在这里可能需要详细说明.

The answer goes a little beyond coding issues, such as the ones SO is mainly about, but it may be worth elaborating a little here.

可以说,原因是整个R平方的概念实际上直接来自统计领域,其中重点放在解释性模型上,并且在机器学习环境中很少使用,重点显然放在预测模型上;至少是AFAIK,而且除了一些非常入门的课程之外,我从来没有(我的意思是从没 ...)看到过预测性建模问题,其中R平方用于任何类型的绩效评估;流行的机器学习介绍也不是偶然的,例如Andrew Ng的机器学习在Coursera,甚至不用提了.并且,如上面的Github线程所述(添加了重点):

Arguably, the reason is that the whole R-squared concept comes in fact directly from the world of statistics, where the emphasis is on interpretative models, and it has little use in machine learning contexts, where the emphasis is clearly on predictive models; at least AFAIK, and beyond some very introductory courses, I have never (I mean never...) seen a predictive modeling problem where the R-squared is used for any kind of performance assessment; neither it's an accident that popular machine learning introductions, such as Andrew Ng's Machine Learning at Coursera, do not even bother to mention it. And, as noted in the Github thread above (emphasis added):

特别是在使用 test 集时,我对R ^ 2的含义有点不清楚.

In particular when using a test set, it's a bit unclear to me what the R^2 means.

我当然同意.

至于上面讨论的边缘情况(是否包含拦截项?),我怀疑这听起来与现代深度学习从业人员真的无关,因为在默认情况下,始终在其中包含等效的拦截(偏差参数)神经网络模型...

As for the edge case discussed above (to include or not an intercept term?), I suspect it would sound really irrelevant to modern deep learning practitioners, where the equivalent of an intercept (bias parameters) is always included by default in neural network models...

请参阅交叉验证问题中的被接受(且被高度评价)的答案

See the accepted (and highly upvoted) answer in the Cross Validated question Difference between statsmodel OLS and scikit linear regression for a more detailed discussion along these last lines...

这篇关于scikit学习statsmodels-哪个R平方是正确的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆