具有 l1 的逻辑的 SGD 分类器结果和 statsmodels 结果的差异 [英] Difference in SGD classifier results and statsmodels results for logistic with l1
问题描述
作为对我工作的检查,我一直在比较 scikit learn 的 SGDClassifier 逻辑实现与 statsmodels 逻辑的输出.一旦我将一些 l1 与分类变量相结合,我就会得到非常不同的结果.这是不同解决方法的结果还是我没有使用正确的参数?
As a check on my work, I've been comparing the output of scikit learn's SGDClassifier logistic implementation with statsmodels logistic. Once I add some l1 in combination with categorical variables, I'm getting very different results. Is this a result of different solution techniques or am I not using the correct parameter?
我自己的数据集差异更大,但使用 mtcars 仍然相当大:
Much bigger differences on my own dataset, but still pretty large using mtcars:
df = sm.datasets.get_rdataset("mtcars", "datasets").data
y, X = patsy.dmatrices('am~standardize(wt) + standardize(disp) + C(cyl) - 1', df)
logit = sm.Logit(y, X).fit_regularized(alpha=.0035)
clf = SGDClassifier(alpha=.0035, penalty='l1', loss='log', l1_ratio=1,
n_iter=1000, fit_intercept=False)
clf.fit(X, y)
给出:
sklearn: [-3.79663192 -1.16145654 0.95744308 -5.90284803 -0.67666106]
statsmodels: [-7.28440744 -2.53098894 3.33574042 -7.50604097 -3.15087396]
推荐答案
我一直在解决一些类似的问题.我认为简短的回答可能是 SGD 仅在少量样本时效果不佳,但在处理较大数据时(更多)性能良好.我有兴趣听取 sklearn 开发人员的意见.比较,例如,使用 LogisticRegression
I've been working through some similar issues. I think the short answer might be that SGD doesn't work so well with only a few samples, but is (much more) performant with larger data. I'd be interested in hearing from sklearn devs. Compare, for example, using LogisticRegression
clf2 = LogisticRegression(penalty='l1', C=1/.0035, fit_intercept=False)
clf2.fit(X, y)
给出非常类似于 l1 惩罚的 Logit.
gives very similar to l1 penalized Logit.
array([[-7.27275526, -2.52638167, 3.32801895, -7.50119041, -3.14198402]])
这篇关于具有 l1 的逻辑的 SGD 分类器结果和 statsmodels 结果的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!