从神经网络的不同成本函数和激活函数中进行选择 [英] Choosing from different cost function and activation function of a neural network

查看:81
本文介绍了从神经网络的不同成本函数和激活函数中进行选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我开始玩弄神经网络.我正在尝试使用Tensorflow实现AND门.我无法理解何时使用不同的费用和激活功能.这是一个基本的神经网络,只有输入和输出层,没有隐藏层.

Recently I started toying with neural networks. I was trying to implement an AND gate with Tensorflow. I am having trouble understanding when to use different cost and activation functions. This is a basic neural network with only input and output layers, no hidden layers.

首先,我尝试以这种方式实现它.如您所见,这是一个较差的实现,但是我认为它至少可以以某种方式完成工作.因此,我只尝试了真实的输出,没有人尝试过真实的输出.对于激活函数,我使用了S型函数,对于成本函数,我使用了平方误差成本函数(我认为是这样,如果我错了,请更正我).

First I tried to implement it in this way. As you can see this is a poor implementation, but I think it gets the job done, at least in some way. So, I tried only the real outputs, no one hot true outputs. For activation functions, I used a sigmoid function and for cost function I used squared error cost function (I think its called that, correct me if I'm wrong).

我尝试将ReLU和Softmax用作激活函数(具有相同的成本函数),但是它不起作用.我弄清楚了为什么它们不起作用.我还尝试了使用交叉熵成本函数的S型函数,该方法也无法正常工作.

I've tried using ReLU and Softmax as activation functions (with the same cost function) and it doesn't work. I figured out why they don't work. I also tried the sigmoid function with Cross Entropy cost function, it also doesn't work.

import tensorflow as tf
import numpy

train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]])
train_Y = numpy.asarray([[0],[0],[0],[1]])

x = tf.placeholder("float",[None, 2])
y = tf.placeholder("float",[None, 1])

W = tf.Variable(tf.zeros([2, 1]))
b = tf.Variable(tf.zeros([1, 1]))

activation = tf.nn.sigmoid(tf.matmul(x, W)+b)
cost = tf.reduce_sum(tf.square(activation - y))/4
optimizer = tf.train.GradientDescentOptimizer(.1).minimize(cost)

init = tf.initialize_all_variables()

with tf.Session() as sess:
    sess.run(init)
    for i in range(5000):
        train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y})

    result = sess.run(activation, feed_dict={x:train_X})
    print(result)

经过5000次迭代:

[[ 0.0031316 ]
[ 0.12012422]
[ 0.12012422]
[ 0.85576665]]

问题1 -是否有其他激活函数和成本函数,可以在不更改参数的情况下(无需更改W,x,b)为上述网络工作(学习).

Question 1 - Is there any other activation function and cost function, that can work(learn) for the above network, without changing the parameters(meaning without changing W, x, b).

问题2 -我从StackOverflow帖子此处阅读:

Question 2 - I read from a StackOverflow post here:

[激活功能]选择取决于问题.

[Activation Function] selection depends on the problem.

因此,没有可以在任何地方使用的成本函数?我的意思是,没有 standard 成本函数可以在任何神经网络上使用.正确的?请对此进行纠正.

So there are no cost functions that can be used anywhere? I mean there is no standard cost function that can be used on any neural network. Right? Please correct me on this.


我还用另一种方法实现了AND门,其输出为单真.如您所见train_Y [1,0]表示第0个索引为1,所以答案为0.希望您能得到它.

I also implemented the AND gate with a different approach, with the output as one-hot true. As you can see the train_Y [1,0] means that the 0th index is 1, so the answer is 0. I hope you get it.

在这里,我使用了softmax激活函数,并将交叉熵作为代价函数.乙状结肠功能作为激活功能严重失败.

Here I have used a softmax activation function, with cross entropy as cost function. Sigmoid function as activation function fails miserably.

import tensorflow as tf
import numpy

train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]])
train_Y = numpy.asarray([[1,0],[1,0],[1,0],[0,1]])

x = tf.placeholder("float",[None, 2])
y = tf.placeholder("float",[None, 2])

W = tf.Variable(tf.zeros([2, 2]))
b = tf.Variable(tf.zeros([2]))

activation = tf.nn.softmax(tf.matmul(x, W)+b)

cost = -tf.reduce_sum(y*tf.log(activation))

optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(cost)

init = tf.initialize_all_variables()

with tf.Session() as sess:
    sess.run(init)
    for i in range(5000):
        train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y})

    result = sess.run(activation, feed_dict={x:train_X})
    print(result)

5000次迭代后

[[  1.00000000e+00   1.41971401e-09]
 [  9.98996437e-01   1.00352429e-03]
 [  9.98996437e-01   1.00352429e-03]
 [  1.40495342e-03   9.98595059e-01]]

问题3 那么在这种情况下,我可以使用哪些成本函数和激活函数?我如何了解应该使用哪种类型的费用和激活功能?是否有标准的方式或规则,或者仅是经验?我是否必须以蛮力方式尝试所有费用和激活功能?我在此处找到了答案.但是我希望有一个更详尽的解释.

Question 3 So in this case what cost function and activation function can I use? How do I understand what type of cost and activation functions I should use? Is there a standard way or rule, or just experience only? Should I have to try every cost and activation function in a brute force manner? I found an answer here. But I am hoping for a more elaborate explanation.

问题4 我注意到,要收敛到接近准确的预测,需要进行多次迭代.我认为融合率取决于学习率(使用太多会丢失解决方案)和成本函数(如果我错了,请纠正我).那么,是否有最佳方法(意味着最快)或成本函数收敛到正确的解决方案?

Question 4 I have noticed that it takes many iterations to converge to a near accurate prediction. I think the convergance rate depends on the learning rate (using too large of will miss the solution) and the cost function (correct me if I'm wrong). So, is there any optimal way (meaning the fastest) or cost function for converging to a correct solution?

推荐答案

我将以一些较混乱的答案回答您的问题,首先是更笼统的答案,最后是针对您特定实验的问题.

I will answer your questions a little bit out of order, starting with more general answers, and finishing with those specific to your particular experiment.

激活功能实际上,不同的激活功能确实具有不同的属性.首先让我们考虑神经网络两层之间的激活函数.激活函数的唯一目的是充当非线性.如果未在两层之间放置激活函数,则两层合起来不会比一层更好,因为它们的效果仍然只是线性变换.长期以来,人们一直在使用S形函数和tanh,随意选择,S形函数更受欢迎,直到最近,ReLU成为主要的非宽容性.人们之所以在各层之间使用ReLU是因为它不饱和(并且计算速度也更快).考虑一下S形函数的图.如果x的绝对值较大,则S型函数的导数也较小,这意味着当我们向后传播误差时,误差的梯度将随着层的返回而迅速消失.使用ReLU时,所有正输入的导数均为1,因此,被激发的神经元的梯度完全不会被激活单元改变,也不会减慢梯度下降的速度.

Activation functions Different activation functions, in fact, do have different properties. Let's first consider an activation function between two layers of a neural network. The only purpose of an activation function there is to serve as an nonlinearity. If you do not put an activation function between two layers, then two layers together will serve no better than one, because their effect will still be just a linear transformation. For a long while people were using sigmoid function and tanh, choosing pretty much arbitrarily, with sigmoid being more popular, until recently, when ReLU became the dominant nonleniarity. The reason why people use ReLU between layers is because it is non-saturating (and is also faster to compute). Think about the graph of a sigmoid function. If the absolute value of x is large, then the derivative of the sigmoid function is small, which means that as we propagate the error backwards, the gradient of the error will vanish very quickly as we go back through the layers. With ReLU the derivative is 1 for all positive inputs, so the gradient for those neurons that fired will not be changed by the activation unit at all and will not slow down the gradient descent.

对于网络的最后一层,激活单元还取决于任务.对于回归,您将需要使用S形或tanh激活,因为您希望结果在0到1之间.对于分类,您将只希望输出之一为1,其他所有零,但是没有可区别的方法来实现.正是这样,所以您将要使用softmax对其进行近似.

For the last layer of the network the activation unit also depends on the task. For regression you will want to use the sigmoid or tanh activation, because you want the result to be between 0 and 1. For classification you will want only one of your outputs to be one and all others zeros, but there's no differentiable way to achieve precisely that, so you will want to use a softmax to approximate it.

您的示例.现在,让我们看一下您的示例.您的第一个示例尝试以以下形式计算AND的输出:

Your example. Now let's look at your example. Your first example tries to compute the output of AND in a following form:

sigmoid(W1 * x1 + W2 * x2 + B)

请注意,W1W2将始终收敛到相同的值,因为(x1x2)的输出应等于(x2x1)的输出.因此,您适合的模型是:

Note that W1 and W2 will always converge to the same value, because the output for (x1, x2) should be equal to the output of (x2, x1). Therefore, the model that you are fitting is:

sigmoid(W * (x1 + x2) + B)

x1 + x2只能采用三个值(0、1或2)之一,对于x1 + x2 < 2的情况,要返回0,对于x1 + x2 = 2的情况,要返回1.由于S形函数相当平滑,因此WB的值将非常大,以使输出接近所需的值,但是由于学习率较低,它们无法快速达到这些大值.在第一个示例中,提高学习率将提高收敛速度.

x1 + x2 can only take one of three values (0, 1 or 2) and you want to return 0 for the case when x1 + x2 < 2 and 1 for the case when x1 + x2 = 2. Since the sigmoid function is rather smooth, it will take very large values of W and B to make the output close to the desired, but because of a small learning rate they can't get to those large values fast. Increasing the learning rate in your first example will increase the speed of convergence.

您的第二个示例收敛得更好,因为softmax函数擅长使一个输出等于1,而所有其他输出等于0.由于这正是您的情况,因此确实可以迅速收敛.请注意,sigmoid最终也将收敛为良好的值,但是它将花费更多的迭代次数(或更高的学习率).

Your second example converges better because the softmax function is good at making precisely one output be equal to 1 and all others to 0. Since this is precisely your case, it does converge quickly. Note that sigmoid would also eventually converge to good values, but it will take significantly more iterations (or higher learning rate).

使用方法.现在到最后一个问题,如何选择要使用的激活和成本函数.这些建议适用于大多数情况:

What to use. Now to the last question, how does one choose which activation and cost functions to use. These advices will work for majority of cases:

  1. 如果进行分类,请使用softmax表示最后一层的非线性,并使用cross entropy作为成本函数.

  1. If you do classification, use softmax for the last layer's nonlinearity and cross entropy as a cost function.

如果进行回归,则将sigmoidtanh用于最后一层的非线性,并将squared error作为成本函数.

If you do regression, use sigmoid or tanh for the last layer's nonlinearity and squared error as a cost function.

将ReLU用作图层之间的非线性.

Use ReLU as a nonlienearity between layers.

使用更好的优化器(AdamOptimizerAdagradOptimizer)代替GradientDescentOptimizer,或使用动量来加快收敛速度​​,

Use better optimizers (AdamOptimizer, AdagradOptimizer) instead of GradientDescentOptimizer, or use momentum for faster convergence,

这篇关于从神经网络的不同成本函数和激活函数中进行选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆