OLS回归:Scikit与Statsmodels? [英] OLS Regression: Scikit vs. Statsmodels?

查看:95
本文介绍了OLS回归:Scikit与Statsmodels?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简短版本:我在某些数据上使用了scikit LinearRegression,但是我习惯于使用p值,因此将数据放入statsmodels OLS中,尽管R ^ 2与同样,可变系数在很大程度上都不同.这使我感到担忧,因为最可能的问题是我在某个地方犯了一个错误,而现在我对这两种输出都没有信心(因为我可能错误地制作了一个模型,但不知道哪个模型).

更长的版本:因为我不知道问题出在哪里,所以我不确切知道要包含哪些细节,而且包括的所有内容可能太多了.我也不确定要包含代码或数据.

我觉得scikit的LR和statsmodels OLS都应该做OLS,据我所知OLS是OLS,所以结果应该是相同的.

对于scikit的LR,无论是否设置normalize = True或= False(我觉得有些奇怪),结果(统计上)都是相同的.

对于statsmodels OLS,我使用sklearn的StandardScaler标准化了数据.我添加了一列,所以它包括一个截距(因为scikit的输出包括一个截距).有关更多信息,请参见: http://statsmodels.sourceforge.net/devel/examples/Generated/example_ols.html (添加此列不会将变量系数更改为任何显着程度,并且截距非常接近于零.)StandardScaler不喜欢我的整数不是浮点数,因此我尝试了此操作: https://github.com/scikit-learn/scikit-learn/issues/1709 这使警告消失了,但结果却完全相同.

当然,我为sklearn方法使用了5倍cv(每次测试和训练数据的R ^ 2都是一致的),而对于statsmodels,我只是将所有数据都扔了.

对于sklearn模型和stats模型,R ^ 2均为0.41(这对社会科学是有益的).这可能是一个好兆头,也可能只是一个巧合.

数据是《魔兽世界》中化身的观察结果(来自 http://mmnet.iis .sinica.edu.tw/dl/wowah/),我想每周使用一些不同的功能来制作它.最初,这是一个数据科学课程的课程项目.

自变量包括一周中的观察次数(int),字符级别(int)(如果在行会中则为布尔值)(布尔值),看到时(工作日,周日前夜,工作日晚和周三的布尔值) ),字符类的虚拟对象(在数据收集时,WoW中只有8个类,因此有7个虚拟变量,并且删除了原始的字符串分类变量),等等.

因变量是每个字符在该周(int)中获得的等级.

有趣的是,类似变量内的某些相对顺序在statsmodels和sklearn之间保持不变.因此,尽管加载有很大的不同,但看到时"的排名顺序是相同的,尽管加载有很大的不同,但角色类假人的排名却是相同的.

我认为这个问题与此类似: Python中的差异statsmodels OLS和R的lm

我足够擅长使用Python和统计信息,但后来不足以弄清楚类似这样的事情.我试着阅读sklearn文档和statsmodels文档,但是如果答案在那里,我的脸不知所措.

我想知道:

  1. 哪个输出可能准确? (请允许,如果我错过了kwarg,它们可能都是.)
  2. 如果我犯了一个错误,那是什么以及如何解决它?
  3. 我能在不问这里的情况下解决这个问题吗?如果可以,怎么办?

我知道这个问题有些模糊(没有代码,没有数据,没有输出),但是我认为这更多地是关于这两个软件包的一般过程.当然,一个似乎更多的统计数据,一个似乎更多的机器学习数据,但是它们都是OLS,所以我不明白为什么输出不一样.

(我什至尝试了其他OLS调用来进行三角测量,一个给出的R ^ 2低得多,一个循环了五分钟,我杀死了它,还有一个崩溃了.)

谢谢!

解决方案

听起来您没有将相同的回归矩阵X输入两个过程(但请参见下文).这是一个示例,向您展示需要为sklearn和stats模型使用哪些选项才能产生相同的结果.

import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

# Generate artificial data (2 regressors + constant)
nobs = 100 
X = np.random.random((nobs, 2)) 
X = sm.add_constant(X)
beta = [1, .1, .5] 
e = np.random.random(nobs)
y = np.dot(X, beta) + e 

# Fit regression model
sm.OLS(y, X).fit().params
>> array([ 1.4507724 ,  0.08612654,  0.60129898])

LinearRegression(fit_intercept=False).fit(X, y).coef_
>> array([ 1.4507724 ,  0.08612654,  0.60129898])

正如评论者建议的那样,即使您给两个程序都指定了相同的X,X可能也没有完整的列排名,并且它们的sm/sk可能在后台采取了(不同的)动作来使OLS计算得以进行通过(即删除不同的列).

我建议您使用pandaspatsy来解决此问题:

import pandas as pd
from patsy import dmatrices

dat = pd.read_csv('wow.csv')
y, X = dmatrices('levels ~ week + character + guild', data=dat)

或者,或者,statsmodels公式界面:

import statsmodels.formula.api as smf
dat = pd.read_csv('wow.csv')
mod = smf.ols('levels ~ week + character + guild', data=dat).fit()

该示例可能有用: http://statsmodels.sourceforge.net/devel/example_formulas.html

Short version: I was using the scikit LinearRegression on some data, but I'm used to p-values so put the data into the statsmodels OLS, and although the R^2 is about the same the variable coefficients are all different by large amounts. This concerns me since the most likely problem is that I've made an error somewhere and now I don't feel confident in either output (since likely I have made one model incorrectly but don't know which one).

Longer version: Because I don't know where the issue is, I don't know exactly which details to include, and including everything is probably too much. I am also not sure about including code or data.

I am under the impression that scikit's LR and statsmodels OLS should both be doing OLS, and as far as I know OLS is OLS so the results should be the same.

For scikit's LR, the results are (statistically) the same whether or not I set normalize=True or =False, which I find somewhat strange.

For statsmodels OLS, I normalize the data using StandardScaler from sklearn. I add a column of ones so it includes an intercept (since scikit's output includes an intercept). More on that here: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html (Adding this column did not change the variable coefficients to any notable degree and the intercept was very close to zero.) StandardScaler didn't like that my ints weren't floats, so I tried this: https://github.com/scikit-learn/scikit-learn/issues/1709 That makes the warning go away but the results are exactly the same.

Granted I'm using 5-folds cv for the sklearn approach (R^2 are consistent for both test and training data each time), and for statsmodels I just throw it all the data.

R^2 is about 0.41 for both sklearn and statsmodels (this is good for social science). This could be a good sign or just a coincidence.

The data is observations of avatars in WoW (from http://mmnet.iis.sinica.edu.tw/dl/wowah/) which I munged about to make it weekly with some different features. Originally this was a class project for a data science class.

Independent variables include number of observations in a week (int), character level (int), if in a guild (Boolean), when seen (Booleans on weekday day, weekday eve, weekday late, and the same three for weekend), a dummy for character class (at the time for the data collection, there were only 8 classes in WoW, so there are 7 dummy vars and the original string categorical variable is dropped), and others.

The dependent variable is how many levels each character gained during that week (int).

Interestingly, some of the relative order within like variables is maintained across statsmodels and sklearn. So, rank order of "when seen" is the same although the loadings are very different, and rank order for the character class dummies is the same although again the loadings are very different.

I think this question is similar to this one: Difference in Python statsmodels OLS and R's lm

I am good enough at Python and stats to make a go of it, but then not good enough to figure something like this out. I tried reading the sklearn docs and the statsmodels docs, but if the answer was there staring me in the face I did not understand it.

I would love to know:

  1. Which output might be accurate? (Granted they might both be if I missed a kwarg.)
  2. If I made a mistake, what is it and how to fix it?
  3. Could I have figured this out without asking here, and if so how?

I know this question has some rather vague bits (no code, no data, no output), but I am thinking it is more about the general processes of the two packages. Sure, one seems to be more stats and one seems to be more machine learning, but they're both OLS so I don't understand why the outputs aren't the same.

(I even tried some other OLS calls to triangulate, one gave a much lower R^2, one looped for five minutes and I killed it, and one crashed.)

Thanks!

解决方案

It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.

import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

# Generate artificial data (2 regressors + constant)
nobs = 100 
X = np.random.random((nobs, 2)) 
X = sm.add_constant(X)
beta = [1, .1, .5] 
e = np.random.random(nobs)
y = np.dot(X, beta) + e 

# Fit regression model
sm.OLS(y, X).fit().params
>> array([ 1.4507724 ,  0.08612654,  0.60129898])

LinearRegression(fit_intercept=False).fit(X, y).coef_
>> array([ 1.4507724 ,  0.08612654,  0.60129898])

As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).

I recommend you use pandas and patsy to take care of this:

import pandas as pd
from patsy import dmatrices

dat = pd.read_csv('wow.csv')
y, X = dmatrices('levels ~ week + character + guild', data=dat)

Or, alternatively, the statsmodels formula interface:

import statsmodels.formula.api as smf
dat = pd.read_csv('wow.csv')
mod = smf.ols('levels ~ week + character + guild', data=dat).fit()

Edit: This example might be useful: http://statsmodels.sourceforge.net/devel/example_formulas.html

这篇关于OLS回归:Scikit与Statsmodels?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆