Tensorflow的损失突然变成了Nan [英] Loss in Tensorflow suddenly turn into nan

查看:988
本文介绍了Tensorflow的损失突然变成了Nan的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用tensorflow时,损失突然变成nan,就像:

When I using tensorflow, the loss suddenly turn into nan, just like:

Epoch:  00001 || cost= 0.675003929
Epoch:  00002 || cost= 0.237375346
Epoch:  00003 || cost= 0.204962473
Epoch:  00004 || cost= 0.191322120
Epoch:  00005 || cost= 0.181427178
Epoch:  00006 || cost= 0.172107664
Epoch:  00007 || cost= 0.171604740
Epoch:  00008 || cost= 0.160334495
Epoch:  00009 || cost= 0.151639721
Epoch:  00010 || cost= 0.149983061
Epoch:  00011 || cost= 0.145890004
Epoch:  00012 || cost= 0.141182279
Epoch:  00013 || cost= 0.140914166
Epoch:  00014 || cost= 0.136189088
Epoch:  00015 || cost= 0.133215346
Epoch:  00016 || cost= 0.130046664
Epoch:  00017 || cost= 0.128267926
Epoch:  00018 || cost= 0.125328618
Epoch:  00019 || cost= 0.125053261
Epoch:  00020 || cost= nan
Epoch:  00021 || cost= nan
Epoch:  00022 || cost= nan
Epoch:  00023 || cost= nan
Epoch:  00024 || cost= nan
Epoch:  00025 || cost= nan
Epoch:  00026 || cost= nan
Epoch:  00027 || cost= nan

主要的培训代码是:

for epoch in range(1000):
    Mcost = 0

    temp = []
    for i in range(total_batch):
        batch_X = X[i*batch_size:(i+1)*batch_size]
        batch_Y = Y[i*batch_size:(i+1)*batch_size]
        solver, c, pY = sess.run([train, cost, y_conv], feed_dict={x: batch_X, y_: batch_Y, keep_prob:0.8})
        Mcost = Mcost + c

    print("Epoch: ", '%05d'%(epoch+1), "|| cost=",'{:.9f}'.format(Mcost/total_batch))

由于在前19个阶段的费用是可以的,所以我认为网络和输入都可以.对于网络,我使用4 CNN,激活功能是relu,最后一层是完全连接,没有激活功能.

Since the cost is OK at the first 19 epoch, I believe that the network and the input is OK. For the network, I use 4 CNN, the activate function is relu, and the last layer is full connect without the activate function.

此外,我知道0/0或log(0)将导致nan.但是,我的损失函数是:

Also, I have known that 0/0 or log(0) will result in nan. But, my loss function is:

c1 = y_conv - y_
c2 = tf.square(c1)
c3 = tf.reduce_sum(c2,1)
c4 = tf.sqrt(c3)
cost = tf.reduce_mean(c4)

我使用GPU GTX 1080运行tensorflow.

I run the tensorflow with GPU GTX 1080.

任何建议都值得赞赏.

Any suggestion is appreciate.

推荐答案

通常,那些NaN来自由于梯度增加而导致的优化差异.它们通常不会立即出现,而是在损失突然增加并在几步之内达到inf的阶段之后出现.您没有看到爆炸性增长的原因可能是因为您只检查每个时期的损失-尝试显示每一步或每几步的损失,您很可能会看到这种现象.

Quite often, those NaN come from a divergence in the optimization due to increasing gradients. They usually don't appear at once, but rather after a phase where the loss increases suddenly and within a few steps reaches inf. The reason you do not see this explosive increase is probably because you check your loss only every epoch -- try to display your loss every step or every few steps and you are likely to see this phenomenon.

关于您的梯度为何突然爆炸的原因,我建议您尝试在损失函数中不使用tf.sqrt.这在数值上应该更稳定. tf.sqrt具有爆炸梯度接近零的不良特性.这意味着一旦您接近解决方案,分散风险就会增加-看起来很像您所观察到的.

As to why your gradient exploses suddenly, I would suggest you try without tf.sqrt in your loss function. This should be more numerically stable. tf.sqrt has the bad property of having an exploding gradient near zero. This means increasing risks of divergence once you get close to a solution -- looks a lot like what you are observing.

这篇关于Tensorflow的损失突然变成了Nan的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆