sklearn LogisticRegression 并更改分类的默认阈值 [英] sklearn LogisticRegression and changing the default threshold for classification

查看:33
本文介绍了sklearn LogisticRegression 并更改分类的默认阈值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 sklearn 包中的 LogisticRegression,并且有一个关于分类的快速问题.我为我的分类器构建了一条 ROC 曲线,结果证明我的训练数据的最佳阈值约为 0.25.我假设创建预测时的默认阈值是 0.5.在进行 10 倍交叉验证时,如何更改此默认设置以找出模型中的准确度?基本上,我希望我的模型为大于 0.25 而不是 0.5 的任何人预测1".我一直在查看所有文档,但似乎无处可寻.

I am using LogisticRegression from the sklearn package, and have a quick question about classification. I built a ROC curve for my classifier, and it turns out that the optimal threshold for my training data is around 0.25. I'm assuming that the default threshold when creating predictions is 0.5. How can I change this default setting to find out what the accuracy is in my model when doing a 10-fold cross-validation? Basically, I want my model to predict a '1' for anyone greater than 0.25, not 0.5. I've been looking through all the documentation, and I can't seem to get anywhere.

推荐答案

这不是内置功能.您可以通过将 LogisticRegression 类包装在您自己的类中并添加您在自定义 predict() 方法中使用的 threshold 属性来添加"它.

That is not a built-in feature. You can "add" it by wrapping the LogisticRegression class in your own class, and adding a threshold attribute which you use inside a custom predict() method.

但是,一些注意事项:

  1. 默认阈值实际上是 0.LogisticRegression.decision_function() 返回到所选分离超平面的有符号距离.如果您正在查看 predict_proba(),那么您正在查看阈值为 0.5 的超平面距离的 logit().但这计算起来更昂贵.
  2. 通过像这样选择最佳"阈值,您正在利用学习后的信息,这会破坏您的测试集(即,您的测试或验证集不再提供对样本外错误的无偏估计).因此,除非您仅在训练集上选择交叉验证循环内的阈值,然后将它和经过训练的分类器与测试集一起使用,否则您可能会引起额外的过度拟合.
  3. 如果您遇到不平衡问题,请考虑使用 class_weight 而不是手动设置阈值.这应该会迫使分类器选择一个离真正感兴趣的类更远的超平面.
  1. The default threshold is actually 0. LogisticRegression.decision_function() returns a signed distance to the selected separation hyperplane. If you are looking at predict_proba(), then you are looking at logit() of the hyperplane distance with a threshold of 0.5. But that's more expensive to compute.
  2. By selecting the "optimal" threshold like this, you are utilizing information post-learning, which spoils your test set (i.e., your test or validation set no longer provides an unbiased estimate of out-of-sample error). You may therefore be inducing additional over-fitting unless you choose the threshold inside a cross-validation loop on your training set only, then use it and the trained classifier with your test set.
  3. Consider using class_weight if you have an unbalanced problem rather than manually setting the threshold. This should force the classifier to choose a hyperplane farther away from the class of serious interest.

这篇关于sklearn LogisticRegression 并更改分类的默认阈值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆