成本函数训练目标与准确度期望目标 [英] Cost function training target versus accuracy desired goal

查看:23
本文介绍了成本函数训练目标与准确度期望目标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我们训练神经网络时,我们通常使用梯度下降,它依赖于一个连续的、可微的实值成本函数.例如,最终的成本函数可能采用均方误差.或者换句话说,梯度下降隐含地假设最终目标是回归 - 最小化实值误差度量.

When we train neural networks, we typically use gradient descent, which relies on a continuous, differentiable real-valued cost function. The final cost function might, for example, take the mean squared error. Or put another way, gradient descent implicitly assumes the end goal is regression - to minimize a real-valued error measure.

有时我们希望神经网络执行分类 - 给定输入,将其分类为两个或多个离散类别.在这种情况下,用户关心的最终目标是分类准确率 - 正确分类的案例百分比.

Sometimes what we want a neural network to do is perform classification - given an input, classify it into two or more discrete categories. In this case, the end goal the user cares about is classification accuracy - the percentage of cases classified correctly.

但是当我们使用神经网络进行分类时,虽然我们的目标是分类准确度,这不是神经网络试图优化的.神经网络仍在尝试优化实值成本函数.有时这些指向同一个方向,但有时它们不是.特别是,我遇到过这样的情况:经过训练以正确最小化成本函数的神经网络的分类准确度比简单的手工编码阈值比较差.

But when we are using a neural network for classification, though our goal is classification accuracy, that is not what the neural network is trying to optimize. The neural network is still trying to optimize the real-valued cost function. Sometimes these point in the same direction, but sometimes they don't. In particular, I've been running into cases where a neural network trained to correctly minimize the cost function, has a classification accuracy worse than a simple hand-coded threshold comparison.

我已将其归结为使用 TensorFlow 的最小测试用例.它设置了一个感知器(没有隐藏层的神经网络),在一个绝对最小的数据集(一个输入变量,一个二进制输出变量)上训练它,评估结果的分类精度,然后将其与简单手的分类精度进行比较- 编码阈值比较;结果分别为 60% 和 80%.直觉上,这是因为输入值较大的单个异常值会产生相应较大的输出值,因此最小化成本函数的方法是更加努力地适应这种情况,在此过程中错误分类了另外两种情况.感知器正确地做它被告知要做的事情;只是这与我们真正想要的分类器不符.但是分类准确率不是一个连续的可微函数,所以不能作为梯度下降的目标.

I've boiled this down to a minimal test case using TensorFlow. It sets up a perceptron (neural network with no hidden layers), trains it on an absolutely minimal dataset (one input variable, one binary output variable) assesses the classification accuracy of the result, then compares it to the classification accuracy of a simple hand-coded threshold comparison; the results are 60% and 80% respectively. Intuitively, this is because a single outlier with a large input value, generates a correspondingly large output value, so the way to minimize the cost function is to try extra hard to accommodate that one case, in the process misclassifying two more ordinary cases. The perceptron is correctly doing what it was told to do; it's just that this does not match what we actually want of a classifier. But the classification accuracy is not a continuous differentiable function, so we can't use it as the target for gradient descent.

我们如何训练神经网络以使其最终最大化分类准确度?

How can we train a neural network so that it ends up maximizing classification accuracy?

import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)

# Parameters
epochs = 10000
learning_rate = 0.01

# Data
train_X = [
    [0],
    [0],
    [2],
    [2],
    [9],
]
train_Y = [
    0,
    0,
    1,
    1,
    0,
]

rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]

# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)

# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))

# Model
pred = tf.tensordot(X, W, 1) + b
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()

# Train
for epoch in range(epochs):
    # Print update at successive doublings of time
    if epoch&(epoch-1) == 0 or epoch == epochs-1:
        print('{} {} {} {}'.format(
            epoch,
            cost.eval({X: train_X, Y: train_Y}),
            W.eval(),
            b.eval(),
            ))
    optimizer.run({X: train_X, Y: train_Y})

# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))

# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))

推荐答案

我们如何训练神经网络以使其最终最大化分类准确度?

How can we train a neural network so that it ends up maximizing classification accuracy?

我正在寻求一种方法来获得更接近准确性的连续代理函数

I'm asking for a way to get a continuous proxy function that's closer to the accuracy

首先,今天用于(深度)神经网络分类任务的损失函数并不是由它们发明的,但它可以追溯到几十年前,它实际上来自于逻辑回归的早期.这是二元分类的简单情况的等式:

To start with, the loss function used today for classification tasks in (deep) neural nets was not invented with them, but it goes back several decades, and it actually comes from the early days of logistic regression. Here is the equation for the simple case of binary classification:

它背后的想法正是想出一个连续的& 可微函数,以便我们能够利用凸优化的(庞大且仍在扩展的)库来解决分类问题.

The idea behind it was exactly to come up with a continuous & differentiable function, so that we would be able to exploit the (vast, and still expanding) arsenal of convex optimization for classification problems.

可以肯定地说,鉴于上述所需的数学约束,上述损失函数是我们迄今为止中最好的.

It is safe to say that the above loss function is the best we have so far, given the desired mathematical constraints mentioned above.

我们是否应该考虑解决并完成这个问题(即更好地近似精度)?至少原则上不会.我还记得一个时代,当时唯一可用的激活函数是 tanhsigmoid;然后是 ReLU 并为该领域带来了真正的推动.同样,有人最终可能会提出更好的损失函数,但可以说这将发生在研究论文中,而不是作为 SO 问题的答案......

Should we consider this problem (i.e. better approximating the accuracy) solved and finished? At least in principle, no. I am old enough to remember an era when the only activation functions practically available were tanh and sigmoid; then came ReLU and gave a real boost to the field. Similarly, someone may eventually come up with a better loss function, but arguably this is going to happen in a research paper, and not as an answer to a SO question...

也就是说,当前的损失函数来自概率论和信息论(与当前的深度学习领域形成鲜明对比的领域)的非常基本考虑基金会)至少会让人怀疑是否有更好的损失建议即将到来.

That said, the very fact that the current loss function comes from very elementary considerations of probability and information theory (fields that, in sharp contrast with the current field of deep learning, stand upon firm theoretical foundations) creates at least some doubt as to if a better proposal for the loss may be just around the corner.

关于损失和准确性之间的关系还有另一个微妙之处,这使得后者在性质上与前者不同,并且经常在此类讨论中丢失.让我详细说明一下...

There is another subtle point on the relation between loss and accuracy, which makes the latter something qualitatively different than the former, and is frequently lost in such discussions. Let me elaborate a little...

与本次讨论相关的所有分类器(即神经网络、逻辑回归等)都是概率;也就是说,它们不返回硬类成员资格 (0/1),而是返回类概率([0, 1] 中的连续实数).

All the classifiers related to this discussion (i.e. neural nets, logistic regression etc) are probabilistic ones; that is, they do not return hard class memberships (0/1) but class probabilities (continuous real numbers in [0, 1]).

为了简单起见,将讨论限制在二元情况下,当将类概率转换为(硬)类成员时,我们隐含地涉及一个阈值,通常等于 0.5,例如如果 p[i] >0.5,然后 class[i] = 1".现在,我们可以发现很多情况,这种天真的默认阈值选择不起作用(首先想到的是严重不平衡的数据集),我们将不得不选择一个不同的.但我们在这里讨论的重点是,这个阈值选择虽然对准确性至关重要,但完全外部于最小化损失的数学优化问题,并作为进一步的绝缘层"在它们之间,妥协了一种简单化的观点,即损失只是准确性的代表(事实并非如此).正如这个交叉验证线程的答案一样:

Limiting the discussion for simplicity to the binary case, when converting a class probability to a (hard) class membership, we are implicitly involving a threshold, usually equal to 0.5, such as if p[i] > 0.5, then class[i] = "1". Now, we can find many cases whet this naive default choice of threshold will not work (heavily imbalanced datasets are the first to come to mind), and we'll have to choose a different one. But the important point for our discussion here is that this threshold selection, while being of central importance to the accuracy, is completely external to the mathematical optimization problem of minimizing the loss, and serves as a further "insulation layer" between them, compromising the simplistic view that loss is just a proxy for accuracy (it is not). As nicely put in the answer of this Cross Validated thread:

当您为新样本的每一类输出概率时,练习的统计部分就结束了.选择一个阈值,超过该阈值,您将新观察结果分类为 1 与 0 不再是统计数据的一部分.它是决策组件的一部分.

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.


稍微扩大已经广泛的讨论:我们是否可能完全摆脱连续和数学优化的(非常)限制性约束?可微函数?换句话说,我们可以去掉反向传播和梯度下降吗?


Enlarging somewhat an already broad discussion: Can we possibly move completely away from the (very) limiting constraint of mathematical optimization of continuous & differentiable functions? In other words, can we do away with back-propagation and gradient descend?

嗯,我们实际上已经这样做了,至少在强化学习的子领域:2017 年是 来自 OpenAI 的新研究关于进化策略成为头条新闻.作为额外的奖励,这是一篇关于该主题的超新鲜(2017 年 12 月)Uber 的论文,再次在社区中引起极大的热情.

Well, we are actually doing so already, at least in the sub-field of reinforcement learning: 2017 was the year when new research from OpenAI on something called Evolution Strategies made headlines. And as an extra bonus, here is an ultra-fresh (Dec 2017) paper by Uber on the subject, again generating much enthusiasm in the community.

这篇关于成本函数训练目标与准确度期望目标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆