训练神经网络中出现极小的NaN值 [英] Extremely small or NaN values appear in training neural network
问题描述
我正在尝试在Haskell中实现神经网络架构,并在MNIST上使用它.
I'm trying to implement a neural network architecture in Haskell, and use it on MNIST.
我正在将hmatrix
软件包用于线性代数.
我的培训框架是使用pipes
软件包构建的.
I'm using the hmatrix
package for linear algebra.
My training framework is built using the pipes
package.
我的代码可以编译并且不会崩溃.但是问题是,层大小(例如1000),小批量大小和学习率的某些组合会在计算中产生NaN
值.经过一番检查,我发现极小的值(1e-100
的顺序)最终出现在激活中.但是,即使没有发生这种情况,培训仍然无法进行.它的损失或准确性没有任何改善.
My code compiles and doesn't crash. But the problem is, certain combinations of layer size (say, 1000), minibatch size, and learning rate give rise to NaN
values in the computations. After some inspection, I see that extremely small values (order of 1e-100
) eventually appear in the activations. But, even when that doesn't happen, the training still doesn't work. There's no improvement over its loss or accuracy.
我检查并重新检查了我的代码,但对于问题的根源,我一无所知.
I checked and rechecked my code, and I'm at a loss as to what the root of the problem could be.
这是反向传播训练,它计算每个图层的增量:
Here's the backpropagation training, which computes the deltas for each layer:
backward lf n (out,tar) das = do
let δout = tr (derivate lf (tar, out)) -- dE/dy
deltas = scanr (\(l, a') δ ->
let w = weights l
in (tr a') * (w <> δ)) δout (zip (tail $ toList n) das)
return (deltas)
lf
是损耗函数,n
是网络(每层的weight
矩阵和bias
向量),out
和tar
是网络的实际输出,das
是每一层的激活导数.
lf
is the loss function, n
is the network (weight
matrix and bias
vector for each layer), out
and tar
are the actual output of the network and the target
(desired) output, and das
are the activation derivatives of each layer.
在批处理模式下,out
,tar
是矩阵(行是输出向量),而das
是矩阵列表.
In batch mode, out
, tar
are matrices (rows are output vectors), and das
is a list of the matrices.
这是实际的梯度计算:
grad lf (n, (i,t)) = do
-- Forward propagation: compute layers outputs and activation derivatives
let (as, as') = unzip $ runLayers n i
(out) = last as
(ds) <- backward lf n (out, t) (init as') -- Compute deltas with backpropagation
let r = fromIntegral $ rows i -- Size of minibatch
let gs = zipWith (\δ a -> tr (δ <> a)) ds (i:init as) -- Gradients for weights
return $ GradBatch ((recip r .*) <$> gs, (recip r .*) <$> squeeze <$> ds)
在这里,lf
和n
与上面的相同,i
是输入,t
是目标输出(均为批处理形式,作为矩阵).
Here, lf
and n
are the same as above, i
is the input, and t
is the target output (both in batch form, as matrices).
squeeze
通过对每一行求和将矩阵转换为向量.也就是说,ds
是增量矩阵的列表,其中每一列对应于小批量行的增量.因此,偏差的梯度是所有微型批次上所有Delta的平均值. gs
也是一样,它对应于权重的梯度.
squeeze
transforms a matrix into a vector by summing over each row. That is, ds
is a list of matrices of deltas, where each column corresponds to the deltas for a row of the minibatch. So, the gradients for the biases are the average of the deltas over all the minibatch. The same thing for gs
, which corresponds to the gradients for the weights.
这是实际的更新代码:
move lr (n, (i,t)) (GradBatch (gs, ds)) = do
-- Update function
let update = (\(FC w b af) g δ -> FC (w + (lr).*g) (b + (lr).*δ) af)
n' = Network.fromList $ zipWith3 update (Network.toList n) gs ds
return (n', (i,t))
lr
是学习率. FC
是图层构造函数,而af
是该图层的激活函数.
lr
is the learning rate. FC
is the layer constructor, and af
is the activation function for that layer.
梯度下降算法可确保为学习率传递一个负值.梯度下降的实际代码只是围绕grad
和move
组成的循环,并带有参数化的停止条件.
The gradient descent algorithm makes sure to pass in a negative value for the learning rate. The actual code for the gradient descent is simply a loop around a composition of grad
and move
, with a parameterized stop condition.
最后,这是均方误差损失函数的代码:
Finally, here's the code for a mean square error loss function:
mse :: (Floating a) => LossFunction a a
mse = let f (y,y') = let gamma = y'-y in gamma**2 / 2
f' (y,y') = (y'-y)
in Evaluator f f'
Evaluator
只是捆绑了一个损失函数及其导数(用于计算输出层的增量).
Evaluator
just bundles a loss function and its derivative (for calculating the delta of the output layer).
其余代码在GitHub上: NeuralNetwork .
The rest of the code is up on GitHub: NeuralNetwork.
因此,如果有人对问题有深刻的了解,甚至只是对我是否正确实施了算法的健全性检查,都将不胜感激.
So, if anyone has an insight into the problem, or even just a sanity check that I'm correctly implementing the algorithm, I'd be grateful.
推荐答案
您知道消失"吗?和爆炸"反向传播中的梯度?我对Haskell不太熟悉,因此我无法轻易了解您的backprop到底在做什么,但它的确像您使用逻辑曲线作为激活函数一样.
Do you know about "vanishing" and "exploding" gradients in backpropagation? I'm not too familiar with Haskell so I can't easily see what exactly your backprop is doing, but it does look like you are using a logistic curve as your activation function.
如果查看此函数的曲线图,您会发现该函数的斜率在末端几乎为0(由于输入值变得非常大或非常小,曲线的斜率几乎是平坦的),因此在反向传播过程中乘以或除以该结果将导致非常大或非常小的数字.穿过多层时重复执行此操作会使激活接近零或无穷大.由于反向传播技术是在训练过程中通过这样做来更新您的体重的,因此您的网络最终会出现很多零或无穷大.
If you look at the plot of this function you'll see that the gradient of this function is nearly 0 at the ends (as input values get very large or very small, the slope of the curve is almost flat), so multiplying or dividing by this during backpropagation will result in a very big or very small number. Doing this repeatedly as you pass through multiple layers causes the activations to approach zero or infinity. Since backprop updates your weights by doing this during training, you end up with a lot of zeros or infinities in your network.
解决方案:可以找到大量方法来解决梯度消失的问题,但是尝试尝试的一件事是将要使用的激活函数的类型更改为非饱和函数. ReLU是一种流行的选择,因为它可以缓解此特定问题(但可能会引入其他问题).
Solution: there are loads of methods out there that you can search for to solve the vanishing gradient problem, but one easy thing to try is to change the type of activation function you are using to a non-saturating one. ReLU is a popular choice as it mitigates this particular problem (but might introduce others).
这篇关于训练神经网络中出现极小的NaN值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!