具有逻辑回归的分类任务的R和scikit学习的比较 [英] Comparison of R and scikit-learn for a classification task with logistic regression

查看:84
本文介绍了具有逻辑回归的分类任务的R和scikit学习的比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做Logistic回归,这本书由James,Witten,Hastie,Tibshirani(2013)撰写的《 R语言中的统计学习及其应用入门》一书描述.

更具体地说,我正在将二进制分类模型拟合到第7.8.1节中描述的R包"ISLR"中的工资"数据集.

将预测变量年龄"(转换为多项式,等级4)与二进制分类工资> 250拟合.然后将年龄与真实"值的预测概率作图.

R中的模型拟合如下:

fit=glm(I(wage>250)~poly(age,4),data=Wage, family=binomial)

agelims=range(age) 
age.grid=seq(from=agelims[1],to=agelims[2])
preds=predict(fit,newdata=list(age=age.grid),se=T)
pfit=exp(preds$fit)/(1+exp(preds$fit))

完整代码(作者网站): http://www-bcf.usc.edu/~gareth/ISL/Chapter%207%20Lab.txt
本书中的相应图: http://www-bcf.usc .edu/〜gareth/ISL/Chapter7/7.1.pdf (右)

我试图在scikit-learn中将模型拟合为相同的数据:

poly = PolynomialFeatures(4)
X = poly.fit_transform(df.age.reshape(-1,1))
y = (df.wage > 250).map({False:0, True:1}).as_matrix()
clf = LogisticRegression()
clf.fit(X,y)

X_test = poly.fit_transform(np.arange(df.age.min(), df.age.max()).reshape(-1,1))
prob = clf.predict_proba(X_test)

然后,我针对年龄范围绘制了真实"值的概率.但是结果/图看起来很不一样. (不是在谈论CI波段或rugplot,而是在谈论概率图.)我在这里错过了什么吗?

解决方案

更多阅读后,我了解到scikit-learn实现了正则化的Logistic回归模型,而R中的glm未实现正则化. Statsmodels的GLM实现(python)未进行规范,其结果与R中相同.

http://statsmodels.sourceforge.net/stable/generation/statsmodels.genmod.generalized_linear_model.GLM.html#statsmodels.genmod.generalized_linear_model.GLM

R包LiblineaR与scikit-learn的逻辑回归(使用'liblinear'求解器时)类似.

https://cran.r-project.org/web/packages/LiblineaR /

I am doing a Logistic Regression described in the book 'An Introduction to Statistical Learning with Applications in R' by James, Witten, Hastie, Tibshirani (2013).

More specifically, I am fitting the binary classification model to the 'Wage' dataset from the R package 'ISLR' described in §7.8.1.

Predictor 'age' (transformed to polynomial, degree 4) is fitted against the binary classification wage>250. Then the age is plotted against the predicted probabilities of the 'True' value.

The model in R is fit as follows:

fit=glm(I(wage>250)~poly(age,4),data=Wage, family=binomial)

agelims=range(age) 
age.grid=seq(from=agelims[1],to=agelims[2])
preds=predict(fit,newdata=list(age=age.grid),se=T)
pfit=exp(preds$fit)/(1+exp(preds$fit))

Complete code (author's site): http://www-bcf.usc.edu/~gareth/ISL/Chapter%207%20Lab.txt
The corresponding plot from the book: http://www-bcf.usc.edu/~gareth/ISL/Chapter7/7.1.pdf (right)

I tried to fit a model to the same data in scikit-learn:

poly = PolynomialFeatures(4)
X = poly.fit_transform(df.age.reshape(-1,1))
y = (df.wage > 250).map({False:0, True:1}).as_matrix()
clf = LogisticRegression()
clf.fit(X,y)

X_test = poly.fit_transform(np.arange(df.age.min(), df.age.max()).reshape(-1,1))
prob = clf.predict_proba(X_test)

I then plotted probabilities of the 'True' values against the age range. But the result/plot looks quite different. (Not talking about the CI bands or rugplot, just the probability plot.) Am I missing something here?

解决方案

After some more reading I understand that scikit-learn implements a regularized logistic regression model, whereas glm in R is not regularized. Statsmodels' GLM implementation (python) is unregularized and gives identical results as in R.

http://statsmodels.sourceforge.net/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html#statsmodels.genmod.generalized_linear_model.GLM

The R package LiblineaR is similar to scikit-learn's logistic regression (when using 'liblinear' solver).

https://cran.r-project.org/web/packages/LiblineaR/

这篇关于具有逻辑回归的分类任务的R和scikit学习的比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆