scikit-learn 中的 class_weight 参数是如何工作的? [英] How does the class_weight parameter in scikit-learn work?

查看：111 发布时间：2021/6/25 20:22:04 python scikit-learn

本文介绍了scikit-learn 中的 class_weight 参数是如何工作的?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在理解 scikit-learn 的逻辑回归中的 class_weight 参数如何运作时遇到了很多麻烦.

I am having a lot of trouble understanding how the class_weight parameter in scikit-learn's Logistic Regression operates.

情况

我想使用逻辑回归对非常不平衡的数据集进行二元分类.这些类别被标记为 0(阴性)和 1(阳性)，观察数据的比例约为 19:1，大多数样本具有阴性结果.

I want to use logistic regression to do binary classification on a very unbalanced data set. The classes are labelled 0 (negative) and 1 (positive) and the observed data is in a ratio of about 19:1 with the majority of samples having negative outcome.

第一次尝试:手动准备训练数据

我将我拥有的数据拆分为不相交的集合以进行训练和测试(大约 80/20).然后我手工随机抽取训练数据，得到不同比例的训练数据，而不是19:1；从 2:1 -> 16:1.

I split the data I had into disjoint sets for training and testing (about 80/20). Then I randomly sampled the training data by hand to get training data in different proportions than 19:1; from 2:1 -> 16:1.

然后我对这些不同的训练数据子集进行逻辑回归训练，并绘制召回率 (= TP/(TP+FN)) 作为不同训练比例的函数.当然，召回率是根据观察到的比例为 19:1 的不相交 TEST 样本计算得出的.请注意，虽然我在不同的训练数据上训练了不同的模型，但我在相同(不相交的)测试数据上计算了所有模型的召回率.

I then trained logistic regression on these different training data subsets and plotted recall (= TP/(TP+FN)) as a function of the different training proportions. Of course, the recall was computed on the disjoint TEST samples which had the observed proportions of 19:1. Note, although I trained the different models on different training data, I computed recall for all of them on the same (disjoint) test data.

结果如预期:在 2:1 的训练比例下，召回率约为 60%，在达到 16:1 时下降得相当快.有几个比例为 2:1 -> 6:1，其中召回率高于 5%.

The results were as expected: the recall was about 60% at 2:1 training proportions and fell off rather fast by the time it got to 16:1. There were several proportions 2:1 -> 6:1 where the recall was decently above 5%.

第二次尝试:网格搜索

接下来，我想测试不同的正则化参数，因此我使用了 GridSearchCV 并制作了一个包含 C 参数和 class_weight 参数的多个值的网格.将我的 n:m 负:正训练样本比例转换为 class_weight 的字典语言，我想我只是指定了几个字典如下:

Next, I wanted to test different regularization parameters and so I used GridSearchCV and made a grid of several values of the C parameter as well as the class_weight parameter. To translate my n:m proportions of negative:positive training samples into the dictionary language of class_weight I thought that I just specify several dictionaries as follows:

{ 0:0.67, 1:0.33 } #expected 2:1
{ 0:0.75, 1:0.25 } #expected 3:1
{ 0:0.8, 1:0.2 }   #expected 4:1

而且我还包括了 None 和 auto.

and I also included None and auto.

这一次的结果完全出乎意料.除了 auto 之外，对于 class_weight 的每个值，我的所有回忆都很小(<0.05).所以我只能假设我对如何设置 class_weight 字典的理解是错误的.有趣的是，对于 C 的所有值，网格搜索中 'auto' 的 class_weight 值约为 59%，我猜它平衡到 1:1?

This time the results were totally wacked. All my recalls came out tiny (< 0.05) for every value of class_weight except auto. So I can only assume that my understanding of how to set the class_weight dictionary is wrong. Interestingly, the class_weight value of 'auto' in the grid search was around 59% for all values of C, and I guessed it balances to 1:1?

我的问题

您如何正确使用 class_weight 来实现训练数据与您实际提供的数据之间的不同平衡?具体来说，我要传递给 class_weight 什么字典来使用负:正训练样本的 n:m 比例?

How do you properly use class_weight to achieve different balances in training data from what you actually give it? Specifically, what dictionary do I pass to class_weight to use n:m proportions of negative:positive training samples?

如果您将各种 class_weight 字典传递给 GridSearchCV，在交叉验证期间，它将根据字典重新平衡训练折叠数据，但使用真实的给定样本比例来计算我的评分函数测试折叠?这一点很重要，因为任何指标只有在来自观察比例的数据时才对我有用.

If you pass various class_weight dictionaries to GridSearchCV, during cross-validation will it rebalance the training fold data according to the dictionary but use the true given sample proportions for computing my scoring function on the test fold? This is critical since any metric is only useful to me if it comes from data in the observed proportions.

class_weight 的 auto 值对比例有什么作用?我阅读了文档，我假设平衡数据与它们的频率成反比"只是意味着它是 1:1.这样对吗?如果没有，有人可以澄清一下吗?

What does the auto value of class_weight do as far as proportions? I read the documentation and I assume "balances the data inversely proportional to their frequency" just means it makes it 1:1. Is this correct? If not, can someone clarify?

scikit-learn 中的 class_weight 参数是如何工作的? [英] How does the class_weight parameter in scikit-learn work?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

scikit-learn 中的 class_weight 参数是如何工作的? [英] How does the class_weight parameter in scikit-learn work?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭