如何在Keras的多个类别中计算总损失? [英] how is total loss calculated over multiple classes in Keras?

查看:98
本文介绍了如何在Keras的多个类别中计算总损失?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我的网络具有以下参数:

Let's say I have network with following params:

  1. 用于语义分割的全卷积网络
  2. loss =加权二进制交叉熵(但可以是任何损失函数,无关紧要)
  3. 5类-输入是图像,地面真相是二进制掩码
  4. 批量大小= 16

现在,我知道损耗是通过以下方式计算的:关于每个类别,二进制交叉熵应用于图像中的每个像素.因此,基本上每个像素都有5个损耗值

Now, I know that the loss is calculated in the following manner: binary cross entropy is applied to each pixel in the image with regards to each class. So essentially, each pixel will have 5 loss values

此步骤后会发生什么?

当我训练我的网络时,它只显示一个时期的单个损失值. 产生单个值需要发生许多级别的损失累积,而在文档/代码中却不清楚如何发生.

When I train my network, it prints only a single loss value for an epoch. There are many levels of loss accumulation that need to happen to produce a single value and how it happens is not clear at all in the docs/code.

  1. 首先组合的内容-(1)该类的损耗值(例如,每个像素组合5个值(每个类一个)),然后是图像中的所有像素,或者(2)像素中的所有像素每个类别的图像,然后将所有类别的损失相加?
  2. 这些不同的像素组合到底发生了什么-在何处求和/在哪里求平均?
  3. Keras的binary_crossentropy平均值超过axis=-1 .那么,这是每个类别的所有像素的平均值还是所有类别的平均值,或者两者都是?
  1. What gets combined first - (1) the loss values of the class(for instance 5 values(one for each class) get combined per pixel) and then all the pixels in the image or (2)all the pixels in the image for each individual class, then all the class losses are combined?
  2. How exactly are these different pixel combinations happening - where is it being summed / where is it being averaged?
  3. Keras's binary_crossentropy averages over axis=-1. So is this an average of all the pixels per class or average of all the classes or is it both??

以另一种方式陈述它:如何将不同类别的损失相结合以产生图像的单个损失值?

To state it in a different way: how are the losses for different classes combined to produce a single loss value for an image?

文档中没有对此进行完全解释,这对不管网络类型如何的人都对keras进行多类预测的人都将非常有帮助.这是 keras代码的开头的链接损失函数首先通过的地方.

This is not explained in the docs at all and would be very helpful for people doing multi-class predictions on keras, regardless of the type of network. Here is the link to the start of keras code where one first passes in the loss function.

我能找到的最接近解释的是

The closest thing I could find to an explanation is

loss:字符串(目标函数的名称)或目标函数.见损失.如果模型具有多个输出,则可以通过传递字典或损失列表来在每个输出上使用不同的损失.该模型将要最小化的损失值将是所有单个损失的总和

loss: String (name of objective function) or objective function. See losses. If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses

来自 keras .那么这是否意味着图像中每个类别的损失都可以简单地求和?

from keras. So does this mean that the losses for each class in the image is simply summed?

示例代码,供其他人尝试.这是从 Kaggle ,并针对多标签预测进行了修改:

Example code here for someone to try it out. Here's a basic implementation borrowed from Kaggle and modified for multi-label prediction:

# Build U-Net model
num_classes = 5
IMG_DIM = 256
IMG_CHAN = 3
weights = {0: 1, 1: 1, 2: 1, 3: 1, 4: 1000} #chose an extreme value just to check for any reaction
inputs = Input((IMG_DIM, IMG_DIM, IMG_CHAN))
s = Lambda(lambda x: x / 255) (inputs)

c1 = Conv2D(8, (3, 3), activation='relu', padding='same') (s)
c1 = Conv2D(8, (3, 3), activation='relu', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)

c2 = Conv2D(16, (3, 3), activation='relu', padding='same') (p1)
c2 = Conv2D(16, (3, 3), activation='relu', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)

c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (p2)
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)

c4 = Conv2D(64, (3, 3), activation='relu', padding='same') (p3)
c4 = Conv2D(64, (3, 3), activation='relu', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)

c5 = Conv2D(128, (3, 3), activation='relu', padding='same') (p4)
c5 = Conv2D(128, (3, 3), activation='relu', padding='same') (c5)

u6 = Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(64, (3, 3), activation='relu', padding='same') (u6)
c6 = Conv2D(64, (3, 3), activation='relu', padding='same') (c6)

u7 = Conv2DTranspose(32, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(32, (3, 3), activation='relu', padding='same') (u7)
c7 = Conv2D(32, (3, 3), activation='relu', padding='same') (c7)

u8 = Conv2DTranspose(16, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(16, (3, 3), activation='relu', padding='same') (u8)
c8 = Conv2D(16, (3, 3), activation='relu', padding='same') (c8)

u9 = Conv2DTranspose(8, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(8, (3, 3), activation='relu', padding='same') (u9)
c9 = Conv2D(8, (3, 3), activation='relu', padding='same') (c9)

outputs = Conv2D(num_classes, (1, 1), activation='sigmoid') (c9)

model = Model(inputs=[inputs], outputs=[outputs])
model.compile(optimizer='adam', loss=weighted_loss(weights), metrics=[mean_iou])

def weighted_loss(weightsList):
    def lossFunc(true, pred):

        axis = -1 #if channels last 
        #axis=  1 #if channels first        
        classSelectors = K.argmax(true, axis=axis) 
        classSelectors = [K.equal(tf.cast(i, tf.int64), tf.cast(classSelectors, tf.int64)) for i in range(len(weightsList))]
        classSelectors = [K.cast(x, K.floatx()) for x in classSelectors]
        weights = [sel * w for sel,w in zip(classSelectors, weightsList)] 

        weightMultiplier = weights[0]
        for i in range(1, len(weights)):
            weightMultiplier = weightMultiplier + weights[i]

        loss = BCE_loss(true, pred) - (1+dice_coef(true, pred))
        loss = loss * weightMultiplier
        return loss
    return lossFunc
model.summary()

在这里可以找到实际的BCE-DICE丢失功能.

The actual BCE-DICE loss function can be found here.

问题的动机:根据上述代码,在20个时间段后,网络的总验证损失约为1%;但是,前4个班级的工会分数平均交集率均高于95%,而最后一个班级则为23%.清楚地表明5年级的学生表现不佳.但是,这种准确性上的损失根本没有反映在损失中.因此,这意味着样本的单个损失被合并在一起,从而完全抵消了我们在第五类中看到的巨大损失.而且,因此,如果将每个样本的损失分批合并,则它仍然非常低.我不确定如何协调此信息.

Motivation for the question: Based on the above code, the total validation loss of the network after 20 epochs is ~1%; however, the mean intersection over union scores for the first 4 classes are above 95% each, but for the last class its 23%. Clearly indicating that the 5th class isn't doing well at all. However, this loss in accuracy isn't being reflected at all in the loss. Hence, that means the individual losses for the sample are being combined in a way that completely negates the huge loss we see for the 5th class. And, so when the per sample losses are being combined over batch, it's still really low. I'm not sure how to reconcile this information.

推荐答案

尽管我已经在相关答案中提到了此答案的一部分,但是让我们详细检查源代码,以更具体地找到答案.

Although I have already mentioned part of this answer in a related answer, but let's inspect the source code step-by-step with more details to find the answer concretely.

首先,让我们进行前馈(!):有一个对weighted_loss函数的调用,该函数将y_truey_predsample_weightmask作为输入:

First, Let's feedforward(!): there is a call to weighted_loss function which takes y_true, y_pred, sample_weight and mask as inputs:

weighted_loss = weighted_losses[i]
# ...
output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)

weighted_loss实际上是一个元素列表,其中包含所有传递给fit方法的(增强的)损失函数:

weighted_loss is actually an element of a list which contains all the (augmented) loss functions passed to fit method:

weighted_losses = [
    weighted_masked_objective(fn) for fn in loss_functions]

我提到的增强"一词在这里很重要.这是因为,如您在上面看到的,实际损失函数由另一个称为

The "augmented" word I mentioned is important here. That's because, as you can see above, the actual loss function is wrapped by another function called weighted_masked_objective which has been defined as follows:

def weighted_masked_objective(fn):
    """Adds support for masking and sample-weighting to an objective function.
    It transforms an objective function `fn(y_true, y_pred)`
    into a sample-weighted, cost-masked objective function
    `fn(y_true, y_pred, weights, mask)`.
    # Arguments
        fn: The objective function to wrap,
            with signature `fn(y_true, y_pred)`.
    # Returns
        A function with signature `fn(y_true, y_pred, weights, mask)`.
    """
    if fn is None:
        return None

    def weighted(y_true, y_pred, weights, mask=None):
        """Wrapper function.
        # Arguments
            y_true: `y_true` argument of `fn`.
            y_pred: `y_pred` argument of `fn`.
            weights: Weights tensor.
            mask: Mask tensor.
        # Returns
            Scalar tensor.
        """
        # score_array has ndim >= 2
        score_array = fn(y_true, y_pred)
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in Theano
            mask = K.cast(mask, K.floatx())
            # mask should have the same shape as score_array
            score_array *= mask
            #  the loss per batch should be proportional
            #  to the number of unmasked samples.
            score_array /= K.mean(mask)

        # apply sample weighting
        if weights is not None:
            # reduce score_array to same ndim as weight array
            ndim = K.ndim(score_array)
            weight_ndim = K.ndim(weights)
            score_array = K.mean(score_array,
                                 axis=list(range(weight_ndim, ndim)))
            score_array *= weights
            score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
        return K.mean(score_array)
return weighted

因此,有一个嵌套函数weighted,它实际上在score_array = fn(y_true, y_pred)行中调用了实际损失函数fn.现在,具体来说,在所提供的OP的例子中,fn(即损失函数)是binary_crossentropy.因此,我们需要看一下 binary_crossentropy() 在Keras:

So, there is a nested function, weighted, that actually calls the real loss function fn in the line score_array = fn(y_true, y_pred). Now, to be concrete, in case of the example the OP provided, the fn (i.e. loss function) is binary_crossentropy. Therefore we need to take a look at the definition of binary_crossentropy() in Keras:

def binary_crossentropy(y_true, y_pred):
    return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)

依次调用后端函数K.binary_crossentropy().如果使用Tensorflow作为后端,则 K.binary_crossentropy() 如下:

which in turn, calls the backend function K.binary_crossentropy(). In case of using Tensorflow as the backend, the definition of K.binary_crossentropy() is as follows:

def binary_crossentropy(target, output, from_logits=False):
    """Binary crossentropy between an output tensor and a target tensor.
    # Arguments
        target: A tensor with the same shape as `output`.
        output: A tensor.
        from_logits: Whether `output` is expected to be a logits tensor.
            By default, we consider that `output`
            encodes a probability distribution.
    # Returns
        A tensor.
    """
    # Note: tf.nn.sigmoid_cross_entropy_with_logits
    # expects logits, Keras expects probabilities.
    if not from_logits:
        # transform back to logits
        _epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
        output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
        output = tf.log(output / (1 - output))

    return tf.nn.sigmoid_cross_entropy_with_logits(labels=target,
                                                   logits=output)

tf.nn.sigmoid_cross_entropy_with_logits 返回:

logits形状相同的张量,具有逻辑对数损失.

A Tensor of the same shape as logits with the componentwise logistic losses.

现在,让我们反向传播(!):考虑以上注意事项,K.binray_crossentropy的输出形状将与y_pred(或y_true)相同.如OP所述,y_true的形状为(batch_size, img_dim, img_dim, num_classes).因此,将K.mean(..., axis=-1)施加到形状为(batch_size, img_dim, img_dim, num_classes)的张量上,这将导致输出形状为(batch_size, img_dim, img_dim)的张量.因此,对于图像中的每个像素,对所有类别的损耗值进行平均.因此,上述weighted函数中score_array的形状将为(batch_size, img_dim, img_dim).还有一个步骤:weighted函数中的return语句再次取均值,即return K.mean(score_array).那么,它是如何计算均值的呢?如果您查看 mean 后端函数,您会发现axis参数默认为None:

Now, let's backpropagate(!): considering the above note, the output shape of K.binray_crossentropy would be the same as y_pred (or y_true). As the OP mentioned, y_true has a shape of (batch_size, img_dim, img_dim, num_classes). Therefore, the K.mean(..., axis=-1) is applied over a tensor of shape (batch_size, img_dim, img_dim, num_classes) which results in an output tensor of shape (batch_size, img_dim, img_dim). So the loss values of all classes are averaged for each pixel in the image. Hence, the shape of score_array in weighted function mentioned above would be (batch_size, img_dim, img_dim). There is one more step: the return statement in weighted function takes the mean again i.e. return K.mean(score_array). So how does it compute the mean? If you take a look at the definition of mean backend function you would find out that the axis argument is None by default:

def mean(x, axis=None, keepdims=False):
    """Mean of a tensor, alongside the specified axis.
    # Arguments
        x: A tensor or variable.
        axis: A list of integer. Axes to compute the mean.
        keepdims: A boolean, whether to keep the dimensions or not.
            If `keepdims` is `False`, the rank of the tensor is reduced
            by 1 for each entry in `axis`. If `keepdims` is `True`,
            the reduced dimensions are retained with length 1.
    # Returns
        A tensor with the mean of elements of `x`.
    """
    if x.dtype.base_dtype == tf.bool:
        x = tf.cast(x, floatx())
return tf.reduce_mean(x, axis, keepdims)

它调用 tf.reduce_mean() ,其中给出了axis=None参数,取输入张量的所有轴的平均值,并返回一个单一值.因此,计算形状(batch_size, img_dim, img_dim)的整个张量的平均值,这将转换为对该批次中所有标签及其所有像素的平均值,并将其作为代表损失值的单个标量值返回.然后,该损失值由Keras报告,并用于优化.

And it calls the tf.reduce_mean() which given an axis=None argument, takes the mean over all the axes of input tensor and return one single value. Therefore, the mean of the whole tensor of shape (batch_size, img_dim, img_dim) is computed, which translates to taking the average over all the labels in the batch and over all their pixels, and is returned as one single scalar value which represents the loss value. Then, this loss value is reported back by Keras and is used for optimization.

奖金:如果我们的模型具有多个输出层并因此使用了多个损失函数,该怎么办?

记住我在此答案中提到的第一段代码:

Remember the first piece of code I mentioned in this answer:

weighted_loss = weighted_losses[i]
# ...
output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)

如您所见,有一个i变量用于索引数组.您可能已经猜对了:它实际上是循环的一部分,该循环使用其指定的损耗函数为每个输出层计算损耗值,然后将所有这些损耗值的(加权)和取为

As you can see there is an i variable which is used for indexing the array. You may have guessed correctly: it is actually part of a loop which computes the loss value for each output layer using its designated loss function and then takes the (weighted) sum of all these loss values to compute the total loss:

# Compute total loss.
total_loss = None
with K.name_scope('loss'):
    for i in range(len(self.outputs)):
        if i in skip_target_indices:
            continue
        y_true = self.targets[i]
        y_pred = self.outputs[i]
        weighted_loss = weighted_losses[i]
        sample_weight = sample_weights[i]
        mask = masks[i]
        loss_weight = loss_weights_list[i]
        with K.name_scope(self.output_names[i] + '_loss'):
            output_loss = weighted_loss(y_true, y_pred,
                                        sample_weight, mask)
        if len(self.outputs) > 1:
            self.metrics_tensors.append(output_loss)
            self.metrics_names.append(self.output_names[i] + '_loss')
        if total_loss is None:
            total_loss = loss_weight * output_loss
        else:
            total_loss += loss_weight * output_loss
    if total_loss is None:
        if not self.losses:
            raise ValueError('The model cannot be compiled '
                                'because it has no loss to optimize.')
        else:
            total_loss = 0.

    # Add regularization penalties
    # and other layer-specific losses.
    for loss_tensor in self.losses:
        total_loss += loss_tensor  

这篇关于如何在Keras的多个类别中计算总损失?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆