sklearn LogisticRegression并更改分类的默认阈值 [英] sklearn LogisticRegression and changing the default threshold for classification

查看:752
本文介绍了sklearn LogisticRegression并更改分类的默认阈值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用sklearn软件包中的LogisticRegression,并对分类有一个快速的问题。我为分类器建立了ROC曲线,结果证明训练数据的最佳阈值约为0.25。我假设创建预测时的默认阈值为0.5。进行10倍交叉验证时,如何更改此默认设置以找出模型的精度?基本上,我希望模型为大于0.25而不是0.5的任何人预测为 1。我一直在浏览所有文档,但似乎一无所获。

I am using LogisticRegression from the sklearn package, and have a quick question about classification. I built a ROC curve for my classifier, and it turns out that the optimal threshold for my training data is around 0.25. I'm assuming that the default threshold when creating predictions is 0.5. How can I change this default setting to find out what the accuracy is in my model when doing a 10-fold cross-validation? Basically, I want my model to predict a '1' for anyone greater than 0.25, not 0.5. I've been looking through all the documentation, and I can't seem to get anywhere.

在此先感谢您的帮助。

推荐答案

这不是内置功能。您可以通过将LogisticRegression类包装到自己的类中并添加阈值属性(在自定义 predict()方法。

That is not a built-in feature. You can "add" it by wrapping the LogisticRegression class in your own class, and adding a threshold attribute which you use inside a custom predict() method.

但是,一些注意事项:


  1. 默认阈值实际上是0。 LogisticRegression.decision_function()返回到所选分隔超平面的有符号距离。如果您正在查看 predict_proba(),那么您正在查看具有阈值的超平面距离的 logit()为0.5。

  2. 通过选择这样的最佳阈值,您可以利用学习后的信息,这会破坏您的测试集(即,您的测试或验证集没有更长的时间可提供对样本外误差的无偏估计)。因此,除非您仅在训练集的交叉验证循环内选择阈值,然后将其和训练有素的分类器与测试集一起使用,否则您可能会导致其他过度拟合。

  3. 考虑如果您遇到不平衡的问题,请使用 class_weight 而不是手动设置阈值。这应该迫使分类器选择远离严重关注类别的超平面。

  1. The default threshold is actually 0. LogisticRegression.decision_function() returns a signed distance to the selected separation hyperplane. If you are looking at predict_proba(), then you are looking at logit() of the hyperplane distance with a threshold of 0.5. But that's more expensive to compute.
  2. By selecting the "optimal" threshold like this, you are utilizing information post-learning, which spoils your test set (i.e., your test or validation set no longer provides an unbiased estimate of out-of-sample error). You may therefore be inducing additional over-fitting unless you choose the threshold inside a cross-validation loop on your training set only, then use it and the trained classifier with your test set.
  3. Consider using class_weight if you have an unbalanced problem rather than manually setting the threshold. This should force the classifier to choose a hyperplane farther away from the class of serious interest.

这篇关于sklearn LogisticRegression并更改分类的默认阈值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆