在Keras中使用sample_weight进行序列标记 [英] Using sample_weight in Keras for sequence labelling

查看:224
本文介绍了在Keras中使用sample_weight进行序列标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理不平衡类的顺序标签问题,我想使用sample_weight解决不平衡问题.基本上,如果我训练模型约10个时间段,我会得到很好的结果.如果我训练更多的纪元,val_loss会继续下降,但结果会更糟.我猜测该模型只会检测到更多的主导类,从而损害较小的类.

I am working on a sequential labeling problem with unbalanced classes and I would like to use sample_weight to resolve the unbalance issue. Basically if I train the model for about 10 epochs, I get great results. If I train for more epochs, val_loss keeps dropping, but I get worse results. I'm guessing the model just detects more of the dominant class to the detriment of the smaller classes.

该模型有两个输入,分别用于单词嵌入和字符嵌入,并且输入是从0到6的7种可能的类之一.

The model has two inputs, for word embeddings and character embeddings, and the input is one of 7 possible classes from 0 to 6.

使用填充,我用于词嵌入的输入层的形状为(3000, 150),而用于词嵌入的输入层的形状为(3000, 150, 15).我将0.3拆分用于测试和训练数据,这意味着单词嵌入的X_train是char嵌入的(2000, 150)(2000, 150, 15). y包含每个单词的正确类,并在维度7的单次热向量中编码,因此其形状为(3000, 150, 7). y同样分为训练和测试集.然后将每个输入馈入双向LSTM.

With the padding, the shape of my input layer for word embeddings is (3000, 150) and the input layer for word embeddings is (3000, 150, 15). I use a 0.3 split for testing and training data, which means X_train for word embeddings is (2000, 150) and (2000, 150, 15) for char embeddings. y contains the correct class for each word, encoded in a one-hot vector of dimension 7, so its shape is (3000, 150, 7). y is likewise split into a training and testing set. Each input is then fed into a Bidirectional LSTM.

输出是一个矩阵,其中为2000个训练样本的每个单词分配了7个类别之一,因此大小为(2000, 150, 7).

The output is a matrix with one of the 7 categories assigned for each word of the 2000 training samples, so the size is (2000, 150, 7).

首先,我只是尝试将sample_weight定义为长度7的np.array,其中包含每个类的权重:

At first, I simply tried to define sample_weight as an np.array of length 7 containing the weights for each class:

count = [list(array).index(1) for arrays in y for array in arrays]
count = dict(Counter(count))
count[0] = 0
total = sum([count[key] for key in count])
count = {k: count[key] / total for key in count}
category_weights = np.zeros(7)
for f in count:
    category_weights[f] = count[f]

但是出现以下错误ValueError: Found a sample_weight array with shape (7,) for an input with shape (2000, 150, 7). sample_weight cannot be broadcast.

看文档,看来我应该通过a 2D array with shape (samples, sequence_length).因此,我创建了一个(3000, 150)数组,其中包含每个序列的每个单词的权重的串联:

Looking at the docs, it looks like I should instead be passing a 2D array with shape (samples, sequence_length). So I create a (3000, 150) array with a concatenation of the weights of every word of each sequence:

weights = []

for sample in y:
    current_weight = []
    for line in sample:
        current_weight.append(frequency[list(line).index(1)])
    weights.append(current_weight)

weights = np.array(weights)

,并在compile()中添加了sample_weight_mode="temporal"选项之后,通过sample_weight参数将该参数传递给fit函数.

and pass that to the fit function through the sample_weight parameter after having added the sample_weight_mode="temporal" option in compile().

我首先遇到一个错误,告诉我尺寸是错误的,但是仅生成了训练样本的权重后,我得到了一个(2000, 150)数组,可以用来拟合我的模型.

I first got an error telling me the dimension was wrong, however after generating the weights for only the training sample, I end up with a (2000, 150) array that I can use to fit my model.

  • 这是定义sample_weights的正确方法还是我做错了一切?我不能说我注意到增加权重后没有任何改善,所以我一定错过了一些事情.

推荐答案

我认为您混淆了sample_weightsclass_weights.查看文档一点,我们可以看到它们之间的区别:

I think you are confusing sample_weights and class_weights. Checking the docs a bit we can see the differences between them:

sample_weights用于为每个训练样本提供权重.这意味着您应该传递一维数组,该数组具有与训练样本相同数量的元素(指示每个样本的重量).如果您使用的是时间数据,则可以改为传递2D数组,从而使您可以权衡每个样本的每个时间步长.

sample_weights is used to provide a weight for each training sample. That means that you should pass a 1D array with the same number of elements as your training samples (indicating the weight for each of those samples). In case you are using temporal data you may instead pass a 2D array, enabling you to give weight to each timestep of each sample.

class_weights用于为每个输出类别提供权重或偏差.这意味着您应该为要分类的每个类传递权重.此外,该参数希望将字典传递给它(不是数组,这就是为什么会出现该错误).例如,考虑这种情况:

class_weights is used to provide a weight or bias for each output class. This means you should pass a weight for each class that you are trying to classify. Furthermore, this parameter expects a dictionary to be passed to it (not an array, that is why you got that error). For example consider this situation:

class_weight = {0 : 1. , 1: 50.}

在这种情况下(二进制分类问题),您给类别1的样本赋予的权重(或相关性")是类别0的50倍.这样,您可以补偿不平衡的数据集.这是另一个有用的帖子解释有关处理不平衡数据集的更多信息,以及其他选择.

In this case (a binary classification problem) you are giving 50 times as much weight (or "relevance") to your samples of class 1 compared to class 0. This way you can compensate for imbalanced datasets. Here is another useful post explaining more about this and other options to consider when dealing with imbalanced datasets.

如果我训练更多的纪元,val_loss会不断下降,但结果会更糟.

If I train for more epochs, val_loss keeps dropping, but I get worse results.

您可能过度拟合了,这可能是由于您正确怀疑的原因导致数据集的类不平衡所致.补偿班级权重应该可以减轻这种情况,但是仍然可能有其他因素导致过度拟合,从而超出了该问题/答案的范围(因此请确保在解决此问题后要当心).

Probably you are over-fitting, and something that may be contributing to that is the imbalanced classes your dataset has, as you correctly suspected. Compensating the class weights should help mitigate this, however there may still be other factors that can cause over-fitting that escape the scope of this question/answer (so make sure to watch out for those after solving this question).

从您的帖子来看,在我看来,您需要使用class_weight平衡训练数据集,为此您需要传递 dictionary 来指示您之间的权重比7节课.仅当要为每个样本提供自定义权重进行考虑时,才考虑使用sample_weight.

Judging by your post, seems to me that what you need is to use class_weight to balance your dataset for training, for which you will need to pass a dictionary indicating the weight ratios between your 7 classes. Consider using sample_weight only if you want to give each sample a custom weight for consideration.

如果您想在这两者之间进行更详细的比较,请考虑检查此答案我在一个相关问题上发布了. 扰流器:sample_weight会覆盖class_weight,因此您必须使用一个或另一个,但不能同时使用,因此请注意不要混用.

If you want a more detailed comparison between those two consider checking this answer I posted on a related question. Spoiler: sample_weight overrides class_weight, so you have to use one or the other, but not both, so be careful with not mixing them.

更新:截至本次编辑(截至2020年3月27日),查看

Update: As of the moment of this edit (March 27, 2020), looking at the source code of training_utils.standardize_weights() we can see that it now supports both class_weights and sample_weights:

一切都标准化为单个样本(或时间步长) 权重数组. 如果同时提供了sample_weightsclass_weights, 权重相乘.

Everything gets normalized to a single sample-wise (or timestep-wise) weight array. If both sample_weights and class_weights are provided, the weights are multiplied together.

这篇关于在Keras中使用sample_weight进行序列标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆