具有statsmodels和sklearn的不同线性回归系数 [英] Different Linear Regression Coefficients with statsmodels and sklearn

查看:444
本文介绍了具有statsmodels和sklearn的不同线性回归系数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我打算使用sklearn linear_model绘制线性回归结果图,并使用statsmodels.api获得学习结果的详细摘要.但是,这两个程序包在相同的输入上产生非常不同的结果.

I was planning to use sklearn linear_model to plot a graph of linear regression result, and statsmodels.api to get a detail summary of the learning result. However, the two packages produce very different results on the same input.

例如,来自sklearn的常数项是7.8e-14,但是来自statsmodels的常数项是48.6. (当同时使用两种方法时,我在x的常数处添加了1的列)这两种方法的代码都是简洁的:

For example, the constant term from sklearn is 7.8e-14, but the constant term from statsmodels is 48.6. (I added a column of 1's in x for constant term when using both methods) My code for both methods are succint:

# Use statsmodels linear regression to get a result (summary) for the model.
def reg_statsmodels(y, x):
    results = sm.OLS(y, x).fit()
    return results

# Use sklearn linear regression to compute the coefficients for the prediction.
def reg_sklearn(y, x):
    lr = linear_model.LinearRegression()
    lr.fit(x, y)
    return lr.coef_

输入内容太复杂,无法在此处发布.单一输入x可能导致此问题吗?

The input is too complicated to post here. Is it possible that a singular input x caused this problem?

通过使用PCA绘制3维图,似乎sklearn结果不是一个很好的近似值.有什么解释?我仍然想进行可视化,因此解决sklearn线性回归实现中的问题将非常有帮助.

By making a 3-d plot using PCA, it seems that the sklearn result is not a good approximation. What are some explanations? I still want to make a visualization, so it will be very helpful to fix the issues in the sklearn linear regression implementation.

推荐答案

你这么说

I added a column of 1's in x for constant term when using both methods

但是文档 LinearRegression说

But the documentation of LinearRegression says that

LinearRegression(fit_intercept=True, [...])

默认情况下适合截距.这可以解释为什么您在常数项上存在差异.

it fits an intercept by default. This could explain why you have the differences in the constant term.

对于其他系数,当其中两个变量高度相关时,可能会出现差异.让我们考虑两个列相同的最极端情况.然后,可以通过增加另一个来补偿减小两者中任何一个的系数.这是我要检查的第一件事.

Now for the other coefficients, differences can occur when two of the variables are highly correlated. Let's consider the most extreme case where two of your columns are identical. Then reducing the coefficient in front of any of the two can be compensated by increasing the other. This is the first thing I'd check.

这篇关于具有statsmodels和sklearn的不同线性回归系数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆