同时使用sample_weight和class_weight [英] Use both sample_weight and class_weight simultaneously

查看:162
本文介绍了同时使用sample_weight和class_weight的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据集已经有加权示例.而且在这种二元分类中,与第二类相比,我还拥有更多的第一类.

My dataset already has weighted examples. And in this binary classification I also have far more of the first class compared to the second.

我可以同时使用sample_weight并在model.fit()函数中使用class_weight对其重新加权吗?

Can I use both sample_weight and further re-weight it with class_weight in the model.fit() function?

还是我首先要制作一个新的new_weights数组并将其作为sample_weight传递给fit函数?

Or do I first make a new array of new_weights and pass it to the fit function as sample_weight?

为了进一步澄清,我已经为数据集中的每个样本设置了单独的权重,并且进一步增加了复杂性,第一类的样本权重的总和远大于第二类的样本权重的总和.

TO further clarify, I already have individual weights for each sample in my dataset, and to further add to the complexity, the total sum of sample weights of the first class is far more than the total sample weights of the second class.

例如,我目前有:

y = [0,0,0,0,1,1]

y = [0,0,0,0,1,1]

sample_weights = [0.01,0.03,0.05,0.02,0.01,0.02]

sample_weights = [0.01,0.03,0.05,0.02, 0.01,0.02]

因此,类'0'权重之和 0.11 ,而类'1' 0.03 .所以我应该有:

so the sum of weights for class '0' is 0.11 and for class '1' is 0.03. So I should have:

class_weight = {0:1,,1:0.11/0.03}

class_weight = {0 : 1. , 1: 0.11/0.03}

我需要同时使用sample_weightclass_weight功能.如果一个覆盖另一个,则必须创建新的sample_weights,然后使用fit()train_on_batch().

I need to use both sample_weight AND class_weight features. If one overrides the other then I will have to create new sample_weights and then use fit() or train_on_batch().

所以我的问题是,我可以同时使用两者,还是可以覆盖另一个?

So my question is, can I use both, or does one override the other?

推荐答案

如果需要,您当然可以做的两件事,那就是您是否需要.根据keras 文档:

You can surely do both if you want, the thing is if that is what you need. According to the keras docs:

  • class_weight::可选的字典映射类索引(整数)到权重(浮点)值,用于对损失函数加权(仅在训练过程中). 告诉模型更多关注"来自代表性不足的班级的样本很有用.

  • class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.

sample_weight::训练样本的可选Numpy权重数组,用于加权损失函数(仅在训练过程中).您可以传递具有与输入样本相同的长度(权重与样本之间的1:1映射)的平面(1D)Numpy数组,或者在使用时态数据的情况下.

sample_weight: Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data [...].

因此,鉴于您提到的是与第二类相比,第一类具有更多的优势".我认为您应该使用class_weight参数.您可以在此处指示数据集呈现的比率,以便您可以补偿不平衡的数据类.当您要为每个数据元素定义权重或重要性时,sample_weight更为重要.

So given that you mention that you "have far more of the first class compared to the second" I think that you should go for the class_weight parameter. There you can indicate that ratio your dataset presents so you can compensate for imbalanced data classes. The sample_weight is more when you want to define a weight or importance for each data element.

例如,如果您通过:

class_weight = {0 : 1. , 1: 50.}

您将说,来自1类的每个样本将被计为来自0类的50个样本,因此,对于来自1类的元素给予更多的重要性"(因为您肯定会减少这些样本) .您可以自定义此设置以满足您自己的需求.有关更多信息,请参见好问题.

you will be saying that every sample from class 1 would count as 50 samples from class 0, therefore giving more "importance" to your elements from class 1 (as you have less of those samples surely). You can custom this to fit your own needs. More info con imbalanced datasets on this great question.

注意:要进一步比较这两个参数,请记住,假设您有以下示例,则将class_weight传递为{0:1., 1:50.}等效于将sample_weight传递为[1.,1.,1.,...,50.,50.,...]. [0,0,0,...,1,1,...].

Note: To further compare both parameters, have in mind that passing class_weight as {0:1., 1:50.} would be equivalent to pass sample_weight as [1.,1.,1.,...,50.,50.,...], given you had samples whose classes where [0,0,0,...,1,1,...].

正如我们所看到的,在这种情况下使用class_weight更实用,而sample_weight可能在更具体的情况下使用,在这些情况下,您实际上希望分别给每个样本一个重要性".如果情况需要,也可以同时使用两者,但是必须牢记其累积作用.

As we can see it is more practical to use class_weight on this case, and sample_weight could be of use on more specific cases where you actually want to give an "importance" to each sample individually. Using both can also be done if the case requires it, but one has to have in mind its cumulative effect.

根据您的新问题,在Keras上进行挖掘

As per your new question, digging on the Keras source code it seems that indeed sample_weights overrides class_weights, here is the piece of code that does it on the _standarize_weigths method (line 499):

if sample_weight is not None:
    #...Does some error handling...
    return sample_weight #simply returns the weights you passed

elif isinstance(class_weight, dict):
    #...Some error handling and computations...
    #Then creates an array repeating class weight to match your target classes
    weights = np.asarray([class_weight[cls] for cls in y_classes
                          if cls in class_weight])

    #...more error handling...
    return weights

这意味着您只能使用其中一个,而不能同时使用两者.因此,您确实需要将sample_weights乘以您需要补偿不平衡的比率.

This means that you can only use one or the other, but not both. Therefore you will indeed need to multiply your sample_weights by the ratio you need to compensate for the imbalance.

更新:截至本次编辑(截至2020年3月27日),查看

Update: As of the moment of this edit (March 27, 2020), looking at the source code of training_utils.standardize_weights() we can see that it now supports both class_weights and sample_weights:

一切都标准化为单个样本(或时间步长) 权重数组. 如果同时提供了sample_weightsclass_weights, 权重相乘.

Everything gets normalized to a single sample-wise (or timestep-wise) weight array. If both sample_weights and class_weights are provided, the weights are multiplied together.

这篇关于同时使用sample_weight和class_weight的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆