scikit-learn &statsmodels - 哪个 R 平方是正确的? [英] scikit-learn & statsmodels - which R-squared is correct?

查看:27
本文介绍了scikit-learn &statsmodels - 哪个 R 平方是正确的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为未来选择最好的算法.我找到了一些解决方案,但我不明白哪个 R-Squared 值是正确的.

I'd like to choose the best algorithm for future. I found some solutions, but I didn't understand which R-Squared value is correct.

为此,我将我的数据分为测试和训练两部分,并在下面打印了两个不同的 R 平方值.

For this, I divided my data into two as test and training, and I printed two different R squared values ​​below.

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

lineer = LinearRegression()
lineer.fit(x_train,y_train)
lineerPredict = lineer.predict(x_test)

scoreLineer = r2_score(y_test, lineerPredict)  # First R-Squared

model = sm.OLS(lineerPredict, y_test)
print(model.fit().summary()) # Second R-Squared

第一个 R 平方结果是 -4.28.
第二个 R-Squared 结果是 0.84

First R-Squared result is -4.28.
Second R-Squared result is 0.84

但我不明白哪个值是正确的.

But I didn't understand which value is correct.

推荐答案

可以说,在这种情况下,真正的挑战是确保将苹果与苹果进行比较.在你的情况下,你似乎没有.我们最好的朋友总是相关的文档,结合简单的实验.所以...

Arguably, the real challenge in such cases is to be sure that you compare apples to apples. And in your case, it seems that you don't. Our best friend is always the relevant documentation, combined with simple experiments. So...

尽管 scikit-learn 的 LinearRegression()(即您的第一个 R 平方)默认使用 fit_intercept=True(docs),这不是 statsmodels 的情况代码>OLS(你的第二个 R 平方);引用自 docs:

Although scikit-learn's LinearRegression() (i.e. your 1st R-squared) is fitted by default with fit_intercept=True (docs), this is not the case with statsmodels' OLS (your 2nd R-squared); quoting from the docs:

默认情况下不包含拦截,应由用户添加.请参阅statsmodels.tools.add_constant.

An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.

记住这个重要的细节,让我们用虚拟数据做一些简单的实验:

Keeping this important detail in mind, let's run some simple experiments with dummy data:

import numpy as np
import statsmodels.api as sm
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

# dummy data:
y = np.array([1,3,4,5,2,3,4])
X = np.array(range(1,8)).reshape(-1,1) # reshape to column

# scikit-learn:
lr = LinearRegression()
lr.fit(X,y)
# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
#     normalize=False)

lr.score(X,y)
# 0.16118421052631582

y_pred=lr.predict(X)
r2_score(y, y_pred)
# 0.16118421052631582


# statsmodels
# first artificially add intercept to X, as advised in the docs:
X_ = sm.add_constant(X)

model = sm.OLS(y,X_) # X_ here
results = model.fit()
results.rsquared
# 0.16118421052631593

出于所有实际目的,由 scikit-learn 和 statsmodels 生成的这两个 R 平方值相同.

For all practical purposes, these two values of R-squared produced by scikit-learn and statsmodels are identical.

让我们更进一步,尝试一个没有拦截的 scikit-learn 模型,但我们使用人工拦截"的我们已经构建用于 statsmodels 的数据 X_:

Let's go a step further, and try a scikit-learn model without intercept, but where we use the artificially "intercepted" data X_ we have already built for use with statsmodels:

lr2 = LinearRegression(fit_intercept=False)
lr2.fit(X_,y) # X_ here
# LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None,
#         normalize=False)

lr2.score(X_, y)
# 0.16118421052631593

y_pred2 = lr2.predict(X_)
r2_score(y, y_pred2)
# 0.16118421052631593

同样,R 平方与之前的值相同.

Again, the R-squared is identical with the previous values.

那么,当我们不小心"忘记考虑 statsmodels OLS 没有拦截的事实?让我们看看:

So, what happens when we "accidentally" forget to account for the fact that statsmodels OLS is fitted without an intercept? Let's see:

model3 = sm.OLS(y,X) # X here, i.e. no intercept
results3 = model2.fit()
results3.rsquared
# 0.8058035714285714

嗯,0.80 的 R 平方确实与带有截距的模型返回的 0.16 相差甚远,可以说这正是您的情况.

Well, an R-squared of 0.80 is indeed very far from the one of 0.16 returned by a model with an intercept, and arguably this is exactly what has happened in your case.

到目前为止一切顺利,我可以在这里轻松完成答案;但确实有一点让这个和谐的世界崩溃了:让我们看看当我们在没有截距的情况下拟合两个模型和没有人为添加任何截距的初始数据 X 时会发生什么.我们已经拟合了上面的 OLS 模型,得到了 0.80 的 R 平方;来自 scikit-learn 的类似模型怎么样?

So far so good, and I could easily finish the answer here; but there is indeed a point where this harmonious world breaks down: let's see what happens when we fit both models without intercept and with the initial data X where we have not artificially added any interception. We have already fitted the OLS model above, and got an R-squared of 0.80; what about a similar model from scikit-learn?

# scikit-learn
lr3 = LinearRegression(fit_intercept=False)
lr3.fit(X,y) # X here
lr3.score(X,y)
# -0.4309210526315792

y_pred3 = lr3.predict(X)
r2_score(y, y_pred3)
# -0.4309210526315792

哎呀...!什么鬼??

Ooops...! What the heck??

似乎 scikit-earn 在计算 r2_score 时,总是假定一个截距,或者在模型中明确(fit_intercept=True) 或隐含在数据中(我们使用 statsmodels 的 add_constant 从上面的 X 生成 X_ 的方式);在网上稍微挖掘一下就会发现一个 Github 线程(没有补救措施就关闭了)确认情况确实如此.

It seems that scikit-earn, when computes the r2_score, always assumes an intercept, either explicitly in the model (fit_intercept=True) or implicitly in the data (the way we have produced X_ from X above, using statsmodels' add_constant); digging a little online reveals a Github thread (closed without a remedy) where it is confirmed that the situation is indeed like that.

[更新 2021 年 12 月:获取更详细的 &深入调查和解释为什么在这种特殊情况下两个分数不同(即两个模型都没有截距拟合),请参见 这个很棒的由弗拉维亚回答]

[UPDATE Dec 2021: for a more detailed & in-depth investigation and explanation of why the two scores are different in this particular case (i.e. both models fitted without an intercept), see this great answer by Flavia]

让我澄清一下,我上面描述的差异与您的问题无关:就您而言,真正的问题是您实际上是在将苹果(具有截距的模型)与橙子进行比较(没有截距的模型).

Let me clarify that the discrepancy I have described above has nothing to do with your issue: in your case, the real issue is that you are actually comparing apples (a model with intercept) with oranges (a model without intercept).

那么,为什么 scikit-learn 不仅在这种(无可否认的边缘)情况下失败,而且即使事实出现在 Github 问题中,它实际上也被冷漠>?(另请注意,在上述帖子中回复的 scikit-learn 核心开发人员随便承认我对统计数据不是很熟悉"......).

So, why scikit-learn not only fails in such an (admittedly edge) case, but even when the fact emerges in a Github issue it is actually treated with indifference? (Notice also that the scikit-learn core developer who replies in the above thread casually admits that "I'm not super familiar with stats"...).

答案有点超出编码问题,例如 SO 主要涉及的问题,但这里可能值得详细说明.

The answer goes a little beyond coding issues, such as the ones SO is mainly about, but it may be worth elaborating a little here.

可以说,原因是整个 R 平方概念实际上直接来自统计领域,重点是解释性模型,在机器学习环境中几乎没有用,其中重点显然是预测模型;至少 AFAIK,除了一些非常介绍性的课程,我从来没有(我的意思是从来没有...)看到一个预测建模问题,其中 R 平方用于任何类型的性能评估;流行的机器学习介绍也并非偶然,例如 Andrew Ng 的 Machine在 Coursera 学习,甚至懒得提.并且,如上面 Github 线程中所述(强调):

Arguably, the reason is that the whole R-squared concept comes in fact directly from the world of statistics, where the emphasis is on interpretative models, and it has little use in machine learning contexts, where the emphasis is clearly on predictive models; at least AFAIK, and beyond some very introductory courses, I have never (I mean never...) seen a predictive modeling problem where the R-squared is used for any kind of performance assessment; neither it's an accident that popular machine learning introductions, such as Andrew Ng's Machine Learning at Coursera, do not even bother to mention it. And, as noted in the Github thread above (emphasis added):

特别是在使用测试集时,我有点不清楚 R^2 的含义.

In particular when using a test set, it's a bit unclear to me what the R^2 means.

我当然同意.

至于上面讨论的边缘情况(包括或不包括拦截项?),我怀疑这听起来对现代深度学习从业者来说真的无关紧要,在默认情况下,拦截的等价物(偏差参数)总是包含在神经网络模型...

As for the edge case discussed above (to include or not an intercept term?), I suspect it would sound really irrelevant to modern deep learning practitioners, where the equivalent of an intercept (bias parameters) is always included by default in neural network models...

请参阅交叉验证问题中已接受(并高度赞成)的答案 statsmodel OLS 和 scikit 线性回归之间的区别 沿着这些最后几行进行更详细的讨论.Is R-squared Useless?中的讨论(和链接)由伟大的统计学家 Cosma Shalizi 的一些相关(负面)评论引发,也很有启发性,强烈推荐.

See the accepted (and highly upvoted) answer in the Cross Validated question Difference between statsmodel OLS and scikit linear regression for a more detailed discussion along these last lines. The discussion (and links) in Is R-squared Useless?, triggered by some relevant (negative) remarks by the great statistician Cosma Shalizi, is also enlightening and highly recommended.

这篇关于scikit-learn &statsmodels - 哪个 R 平方是正确的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆