训练神经网络中出现极小的NaN值 [英] Extremely small or NaN values appear in training neural network

查看:297
本文介绍了训练神经网络中出现极小的NaN值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Haskell中实现神经网络架构,并在MNIST上使用它.

I'm trying to implement a neural network architecture in Haskell, and use it on MNIST.

我正在将hmatrix软件包用于线性代数. 我的培训框架是使用pipes软件包构建的.

I'm using the hmatrix package for linear algebra. My training framework is built using the pipes package.

我的代码可以编译并且不会崩溃.但是问题是,层大小(例如1000),小批量大小和学习率的某些组合会在计算中产生NaN值.经过一番检查,我发现极小的值(1e-100的顺序)最终出现在激活中.但是,即使没有发生这种情况,培训仍然无法进行.它的损失或准确性没有任何改善.

My code compiles and doesn't crash. But the problem is, certain combinations of layer size (say, 1000), minibatch size, and learning rate give rise to NaN values in the computations. After some inspection, I see that extremely small values (order of 1e-100) eventually appear in the activations. But, even when that doesn't happen, the training still doesn't work. There's no improvement over its loss or accuracy.

我检查并重新检查了我的代码,但对于问题的根源,我一无所知.

I checked and rechecked my code, and I'm at a loss as to what the root of the problem could be.

这是反向传播训练,它计算每个图层的增量:

Here's the backpropagation training, which computes the deltas for each layer:

backward lf n (out,tar) das = do
    let δout = tr (derivate lf (tar, out)) -- dE/dy
        deltas = scanr (\(l, a') δ ->
                         let w = weights l
                         in (tr a') * (w <> δ)) δout (zip (tail $ toList n) das)
    return (deltas)

lf是损耗函数,n是网络(每层的weight矩阵和bias向量),outtar是网络的实际输出,(期望)输出,而das是每一层的激活导数.

lf is the loss function, n is the network (weight matrix and bias vector for each layer), out and tar are the actual output of the network and the target (desired) output, and das are the activation derivatives of each layer.

在批处理模式下,outtar是矩阵(行是输出向量),而das是矩阵列表.

In batch mode, out, tar are matrices (rows are output vectors), and das is a list of the matrices.

这是实际的梯度计算:

  grad lf (n, (i,t)) = do
    -- Forward propagation: compute layers outputs and activation derivatives
    let (as, as') = unzip $ runLayers n i
        (out) = last as
    (ds) <- backward lf n (out, t) (init as') -- Compute deltas with backpropagation
    let r  = fromIntegral $ rows i -- Size of minibatch
    let gs = zipWith (\δ a -> tr (δ <> a)) ds (i:init as) -- Gradients for weights
    return $ GradBatch ((recip r .*) <$> gs, (recip r .*) <$> squeeze <$> ds)

在这里,lfn与上面的相同,i是输入,t是目标输出(均为批处理形式,作为矩阵).

Here, lf and n are the same as above, i is the input, and t is the target output (both in batch form, as matrices).

squeeze通过对每一行求和将矩阵转换为向量.也就是说,ds是增量矩阵的列表,其中每一列对应于小批量行的增量.因此,偏差的梯度是所有微型批次上所有Delta的平均值. gs也是一样,它对应于权重的梯度.

squeeze transforms a matrix into a vector by summing over each row. That is, ds is a list of matrices of deltas, where each column corresponds to the deltas for a row of the minibatch. So, the gradients for the biases are the average of the deltas over all the minibatch. The same thing for gs, which corresponds to the gradients for the weights.

这是实际的更新代码:

move lr (n, (i,t)) (GradBatch (gs, ds)) = do
    -- Update function
    let update = (\(FC w b af) g δ -> FC (w + (lr).*g) (b + (lr).*δ) af)
        n' = Network.fromList $ zipWith3 update (Network.toList n) gs ds
    return (n', (i,t))

lr是学习率. FC是图层构造函数,而af是该图层的激活函数.

lr is the learning rate. FC is the layer constructor, and af is the activation function for that layer.

梯度下降算法可确保为学习率传递一个负值.梯度下降的实际代码只是围绕gradmove组成的循环,并带有参数化的停止条件.

The gradient descent algorithm makes sure to pass in a negative value for the learning rate. The actual code for the gradient descent is simply a loop around a composition of grad and move, with a parameterized stop condition.

最后,这是均方误差损失函数的代码:

Finally, here's the code for a mean square error loss function:

mse :: (Floating a) => LossFunction a a
mse = let f (y,y') = let gamma = y'-y in gamma**2 / 2
          f' (y,y') = (y'-y)
      in  Evaluator f f'

Evaluator只是捆绑了一个损失函数及其导数(用于计算输出层的增量).

Evaluator just bundles a loss function and its derivative (for calculating the delta of the output layer).

其余代码在GitHub上: NeuralNetwork .

The rest of the code is up on GitHub: NeuralNetwork.

因此,如果有人对问题有深刻的了解,甚至只是对我是否正确实施了算法的健全性检查,都将不胜感激.

So, if anyone has an insight into the problem, or even just a sanity check that I'm correctly implementing the algorithm, I'd be grateful.

推荐答案

您知道消失"吗?和爆炸"反向传播中的梯度?我对Haskell不太熟悉,因此我无法轻易了解您的backprop到底在做什么,但它的确像您使用逻辑曲线作为激活函数一样.

Do you know about "vanishing" and "exploding" gradients in backpropagation? I'm not too familiar with Haskell so I can't easily see what exactly your backprop is doing, but it does look like you are using a logistic curve as your activation function.

如果查看此函数的曲线图,您会发现该函数的斜率在末端几乎为0(由于输入值变得非常大或非常小,曲线的斜率几乎是平坦的),因此在反向传播过程中乘以或除以该结果将导致非常大或非常小的数字.穿过多层时重复执行此操作会使激活接近零或无穷大.由于反向传播技术是在训练过程中通过这样做来更新您的体重的,因此您的网络最终会出现很多零或无穷大.

If you look at the plot of this function you'll see that the gradient of this function is nearly 0 at the ends (as input values get very large or very small, the slope of the curve is almost flat), so multiplying or dividing by this during backpropagation will result in a very big or very small number. Doing this repeatedly as you pass through multiple layers causes the activations to approach zero or infinity. Since backprop updates your weights by doing this during training, you end up with a lot of zeros or infinities in your network.

解决方案:可以找到大量方法来解决梯度消失的问题,但是尝试尝试的一件事是将要使用的激活函数的类型更改为非饱和函数. ReLU是一种流行的选择,因为它可以缓解此特定问题(但可能会引入其他问题).

Solution: there are loads of methods out there that you can search for to solve the vanishing gradient problem, but one easy thing to try is to change the type of activation function you are using to a non-saturating one. ReLU is a popular choice as it mitigates this particular problem (but might introduce others).

这篇关于训练神经网络中出现极小的NaN值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆