Keras和TensorFlow中所有这些交叉熵损失之间有什么区别? [英] What are the differences between all these cross-entropy losses in Keras and TensorFlow?

查看:123
本文介绍了Keras和TensorFlow中所有这些交叉熵损失之间有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所有这些交叉熵损失之间有什么区别?

What are the differences between all these cross-entropy losses?

Keras在谈论

  • 二进制交叉熵
  • 分类交叉熵
  • 稀疏分类交叉熵
  • Binary cross-entropy
  • Categorical cross-entropy
  • Sparse categorical cross-entropy

TensorFlow具有

While TensorFlow has

  • 带有logit的Softmax交叉熵
  • 带有logit的稀疏softmax交叉熵
  • 具有logits的S型交叉熵
  • Softmax cross-entropy with logits
  • Sparse softmax cross-entropy with logits
  • Sigmoid cross-entropy with logits

它们之间有什么区别和关系?它们的典型应用是什么?数学背景是什么?还有其他应该知道的交叉熵类型吗?是否有没有logit的交叉熵类型?

What are the differences and relationships between them? What are the typical applications for them? What's the mathematical background? Are there other cross-entropy types that one should know? Are there any cross-entropy types without logits?

推荐答案

只有一个交叉(香农)熵定义为:

There is just one cross (Shannon) entropy defined as:

H(P||Q) = - SUM_i P(X=i) log Q(X=i)

在机器学习用法中,P是实际(基本事实)分布,而Q是预测分布.您列出的所有功能都是辅助功能,它接受表示PQ的不同方法.

In machine learning usage, P is the actual (ground truth) distribution, and Q is the predicted distribution. All the functions you listed are just helper functions which accepts different ways to represent P and Q.

基本上要考虑以下三点:

There are basically 3 main things to consider:

  • 有两种可能的结果(二进制分类)或更多.如果只有两个结果,则Q(X=1) = 1 - Q(X=0),因此(0,1)中的单个浮点数标识整个分布,这就是为什么二进制分类中的神经网络具有单个输出(逻辑回归也是如此)的原因.如果有K> 2个可能的结果,则必须定义K个输出(每个Q(X=...)一个)

  • there are either 2 possibles outcomes (binary classification) or more. If there are just two outcomes, then Q(X=1) = 1 - Q(X=0) so a single float in (0,1) identifies the whole distribution, this is why neural network in binary classification has a single output (and so does logistic regresssion). If there are K>2 possible outcomes one has to define K outputs (one per each Q(X=...))

要么产生适当的概率(即Q(X=i)>=0SUM_i Q(X=i) =1,要么产生一个分数",并且具有将得分转换为概率的固定方法.例如,可以将单个实数转换" sigmoid转换为概率",一组实数可以通过取其softmax等来转换.

one either produces proper probabilities (meaning that Q(X=i)>=0 and SUM_i Q(X=i) =1 or one just produces a "score" and has some fixed method of transforming score to probability. For example a single real number can be "transformed to probability" by taking sigmoid, and a set of real numbers can be transformed by taking their softmax and so on.

存在j,使得P(X=j)=1(存在一个真实类别",目标是坚硬",例如此图像代表一只猫")或存在软目标"(例如我们有60%的人确定这是猫,但有40%的人实际上是狗".

there is j such that P(X=j)=1 (there is one "true class", targets are "hard", like "this image represent a cat") or there are "soft targets" (like "we are 60% sure this is a cat, but for 40% it is actually a dog").

根据这三个方面,应使用不同的帮助程序功能:

Depending on these three aspects, different helper function should be used:

                                  outcomes     what is in Q    targets in P   
-------------------------------------------------------------------------------
binary CE                                2      probability         any
categorical CE                          >2      probability         soft
sparse categorical CE                   >2      probability         hard
sigmoid CE with logits                   2      score               any
softmax CE with logits                  >2      score               soft
sparse softmax CE with logits           >2      score               hard

最后,人们只能使用分类交叉熵",因为这是数学上定义的方式,但是由于诸如硬目标或二进制分类之类的东西非常流行-现代ML库确实提供了这些附加的辅助函数来完成任务更简单.特别地,堆叠" S形和交叉熵可能在数值上不稳定,但是如果人们知道这两个操作是一起应用的,那么它们的组合就存在一个数值上稳定的版本(在TF中实现).

In the end one could just use "categorical cross entropy", as this is how it is mathematically defined, however since things like hard targets or binary classification are very popular - modern ML libraries do provide these additional helper functions to make things simpler. In particular "stacking" sigmoid and cross entropy might be numerically unstable, but if one knows these two operations are applied together - there is a numerically stable version of them combined (which is implemented in TF).

重要的是要注意,如果您应用错误的辅助函数,代码通常仍将执行,但结果将是错误的.例如,如果您将softmax_ * helper应用于具有一个输出的二进制分类,则您的网络将被视为始终在输出中产生"True".

It is important to notice that if you apply wrong helper function the code will usually still execute, but results will be wrong. For example if you apply softmax_* helper for binary classification with one output your network will be considered to always produce "True" at the output.

最后一点-这个答案考虑的是分类,当您考虑多标签情况(单个点可以有多个标签)时,情况略有不同,因此P不等于1,尽管有多个输出单位,也应该使用sigmoid_cross_entropy_with_logits.

As a final note - this answer considers classification, it is slightly different when you consider multi label case (when a single point can have multiple labels), as then Ps do not sum to 1, and one should use sigmoid_cross_entropy_with_logits despite having multiple output units.

这篇关于Keras和TensorFlow中所有这些交叉熵损失之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆