scikit-learn 中的 class_weight 参数是如何工作的? [英] How does the class_weight parameter in scikit-learn work?

查看:111
本文介绍了scikit-learn 中的 class_weight 参数是如何工作的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在理解 scikit-learn 的逻辑回归中的 class_weight 参数如何运作时遇到了很多麻烦.

I am having a lot of trouble understanding how the class_weight parameter in scikit-learn's Logistic Regression operates.

情况

我想使用逻辑回归对非常不平衡的数据集进行二元分类.这些类别被标记为 0(阴性)和 1(阳性),观察数据的比例约为 19:1,大多数样本具有阴性结果.

I want to use logistic regression to do binary classification on a very unbalanced data set. The classes are labelled 0 (negative) and 1 (positive) and the observed data is in a ratio of about 19:1 with the majority of samples having negative outcome.

第一次尝试:手动准备训练数据

我将我拥有的数据拆分为不相交的集合以进行训练和测试(大约 80/20).然后我手工随机抽取训练数据,得到不同比例的训练数据,而不是19:1;从 2:1 -> 16:1.

I split the data I had into disjoint sets for training and testing (about 80/20). Then I randomly sampled the training data by hand to get training data in different proportions than 19:1; from 2:1 -> 16:1.

然后我对这些不同的训练数据子集进行逻辑回归训练,并绘制召回率 (= TP/(TP+FN)) 作为不同训练比例的函数.当然,召回率是根据观察到的比例为 19:1 的不相交 TEST 样本计算得出的.请注意,虽然我在不同的训练数据上训练了不同的模型,但我在相同(不相交的)测试数据上计算了所有模型的召回率.

I then trained logistic regression on these different training data subsets and plotted recall (= TP/(TP+FN)) as a function of the different training proportions. Of course, the recall was computed on the disjoint TEST samples which had the observed proportions of 19:1. Note, although I trained the different models on different training data, I computed recall for all of them on the same (disjoint) test data.

结果如预期:在 2:1 的训练比例下,召回率约为 60%,在达到 16:1 时下降得相当快.有几个比例为 2:1 -> 6:1,其中召回率高于 5%.

The results were as expected: the recall was about 60% at 2:1 training proportions and fell off rather fast by the time it got to 16:1. There were several proportions 2:1 -> 6:1 where the recall was decently above 5%.

第二次尝试:网格搜索

接下来,我想测试不同的正则化参数,因此我使用了 GridSearchCV 并制作了一个包含 C 参数和 class_weight 参数的多个值的网格.将我的 n:m 负:正训练样本比例转换为 class_weight 的字典语言,我想我只是指定了几个字典如下:

Next, I wanted to test different regularization parameters and so I used GridSearchCV and made a grid of several values of the C parameter as well as the class_weight parameter. To translate my n:m proportions of negative:positive training samples into the dictionary language of class_weight I thought that I just specify several dictionaries as follows:

{ 0:0.67, 1:0.33 } #expected 2:1
{ 0:0.75, 1:0.25 } #expected 3:1
{ 0:0.8, 1:0.2 }   #expected 4:1

而且我还包括了 Noneauto.

and I also included None and auto.

这一次的结果完全出乎意料.除了 auto 之外,对于 class_weight 的每个值,我的所有回忆都很小(<0.05).所以我只能假设我对如何设置 class_weight 字典的理解是错误的.有趣的是,对于 C 的所有值,网格搜索中 'auto' 的 class_weight 值约为 59%,我猜它平衡到 1:1?

This time the results were totally wacked. All my recalls came out tiny (< 0.05) for every value of class_weight except auto. So I can only assume that my understanding of how to set the class_weight dictionary is wrong. Interestingly, the class_weight value of 'auto' in the grid search was around 59% for all values of C, and I guessed it balances to 1:1?

我的问题

  1. 您如何正确使用 class_weight 来实现训练数据与您实际提供的数据之间的不同平衡?具体来说,我要传递给 class_weight 什么字典来使用负:正训练样本的 n:m 比例?

  1. How do you properly use class_weight to achieve different balances in training data from what you actually give it? Specifically, what dictionary do I pass to class_weight to use n:m proportions of negative:positive training samples?

如果您将各种 class_weight 字典传递给 GridSearchCV,在交叉验证期间,它将根据字典重新平衡训练折叠数据,但使用真实的给定样本比例来计算我的评分函数测试折叠?这一点很重要,因为任何指标只有在来自观察比例的数据时才对我有用.

If you pass various class_weight dictionaries to GridSearchCV, during cross-validation will it rebalance the training fold data according to the dictionary but use the true given sample proportions for computing my scoring function on the test fold? This is critical since any metric is only useful to me if it comes from data in the observed proportions.

class_weightauto 值对比例有什么作用?我阅读了文档,我假设平衡数据与它们的频率成反比"只是意味着它是 1:1.这样对吗?如果没有,有人可以澄清一下吗?

What does the auto value of class_weight do as far as proportions? I read the documentation and I assume "balances the data inversely proportional to their frequency" just means it makes it 1:1. Is this correct? If not, can someone clarify?

推荐答案

首先,仅靠回忆可能不好.通过将所有内容归类为正类,您可以简单地实现 100% 的召回率.我通常建议使用 AUC 来选择参数,然后为您感兴趣的操作点(比如给定的精度水平)找到一个阈值.

First off, it might not be good to just go by recall alone. You can simply achieve a recall of 100% by classifying everything as the positive class. I usually suggest using AUC for selecting parameters, and then finding a threshold for the operating point (say a given precision level) that you are interested in.

class_weight 的工作原理:它用 class_weight[i] 而不是 1 来惩罚 class[i] 样本中的错误.所以更高class-weight 意味着你想更加强调一个类.从你所说的看来,0 类的频率是 1 类的 19 倍.所以你应该增加 1 类相对于 0 类的 class_weight,比如 {0:.1, 1:.9}.如果class_weight之和不为1,则基本上会改变正则化参数.

For how class_weight works: It penalizes mistakes in samples of class[i] with class_weight[i] instead of 1. So higher class-weight means you want to put more emphasis on a class. From what you say it seems class 0 is 19 times more frequent than class 1. So you should increase the class_weight of class 1 relative to class 0, say {0:.1, 1:.9}. If the class_weight doesn't sum to 1, it will basically change the regularization parameter.

有关 class_weight="auto" 的工作原理,您可以查看 这个讨论.在开发版本中,您可以使用 class_weight="balanced",这更容易理解:它基本上意味着复制较小的类,直到您拥有与较大类一样多的样本,但以隐式方式.

For how class_weight="auto" works, you can have a look at this discussion. In the dev version you can use class_weight="balanced", which is easier to understand: it basically means replicating the smaller class until you have as many samples as in the larger one, but in an implicit way.

这篇关于scikit-learn 中的 class_weight 参数是如何工作的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆