使用 softmax 将 Keras 中的自定义损失设置为 one-hot [英] Custom loss in Keras with softmax to one-hot

查看:54
本文介绍了使用 softmax 将 Keras 中的自定义损失设置为 one-hot的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个输出 Softmax 的模型,我想开发一个自定义损失函数.所需的行为是:

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:

1) Softmax 到 one-hot(通常我做 numpy.argmax(softmax_vector) 并在空向量中将该索引设置为 1,但这在损失函数中是不允许的).

1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).

2) 将得到的 one-hot 向量乘以我的嵌入矩阵以获得嵌入向量(在我的上下文中:与给定词相关联的词向量,其中词已被标记化并分配给索引或类用于 Softmax 输出).

2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).

3) 将此向量与目标进行比较(这可能是正常的 Keras 损失函数).

3) Compare this vector with the target (this could be a normal Keras loss function).

我知道如何编写自定义损失函数,但不会这样做.我发现这个密切相关的问题(未回答),但我的情况有点不同,因为我想保留我的 softmax 输出.

I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.

推荐答案

范罗的回答指出了正确的方向,但最终行不通,因为它涉及不可推导的操作.请注意,此类操作对于真实值是可以接受的(损失函数采用真实值和预测值,不可推导的操作仅适用于真实值).

Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).

说句公道话,我一开始就是这么问的.不可能做我想做的事,但我们可以得到类似的和可推导出的行为:

To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:

1) softmax 值的元素级幂.这使得较小的值小得多.例如,幂为 4 [0.5, 0.2, 0.7] 变为 [0.0625, 0.0016, 0.2400].请注意,0.2 与 0.7 相当,但 0.0016 相对于 0.24 可以忽略不计.my_power 越高,最终结果越接近 one-hot.

1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.

soft_extreme = Lambda(lambda x: x ** my_power)(softmax)

2) 重要的是,softmax 和 one-hot 向量都被归一化,但不是我们的soft_extreme".首先,求数组的总和:

2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:

norm = tf.reduce_sum(soft_extreme, 1)

3) 标准化 soft_extreme:

3) Normalize soft_extreme:

almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)

注意:在 1) 中将 my_power 设置得太高会导致 NaN.如果你需要更好的 softmax 到 one-hot 的转换,那么你可以连续两次或更多次地执行步骤 1 到 3.

Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.

4) 最后我们想要字典中的向量.禁止查找,但我们可以使用矩阵乘法取平均向量.因为我们的 soft_normalized 类似于 one-hot 编码,所以这个平均值将类似于与最高参数(原始预期行为)相关联的向量.(1) 中的 my_power 越高,这将越真实:

4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:

target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])

注意:这不能直接使用批处理!就我而言,我使用 tf.reshape 重塑了我的one hot"(从 [batch, dictionary_length][batch, 1, dictionary_length]>.然后平铺了我的embedding_matrix批处理时间,最后使用:

Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:

predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)

可能有更优雅的解决方案(或更少的内存占用,如果平铺嵌入矩阵不是一种选择),所以请随意探索更多.

There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.

这篇关于使用 softmax 将 Keras 中的自定义损失设置为 one-hot的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆