Keras/Tensorflow 中的 L1 正则化*真的* L1 正则化吗? [英] Is the L1 regularization in Keras/Tensorflow *really* L1-regularization?

查看:21
本文介绍了Keras/Tensorflow 中的 L1 正则化*真的* L1 正则化吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 keras.regularizers.l1(0.01) 在 Keras 中对我的神经网络参数使用 L1 正则化以获得稀疏模型.我发现,虽然我的许多系数接近为零,但实际上很少为零.

I am employing L1 regularization on my neural network parameters in Keras with keras.regularizers.l1(0.01) to obtain a sparse model. I am finding that, while many of my coefficients are close to zero, few of them are actually zero.

查看正则化的源代码,这表明 Keras 只是将参数的 L1 范数添加到损失函数中.

Upon looking at the source code for the regularization, it suggests that Keras simply adds the L1 norm of the parameters to the loss function.

这是不正确的,因为参数几乎肯定不会按照 L1 正则化的预期变为零(在浮点误差范围内).当参数为零时,L1 范数是不可微的,因此如果在优化例程中足够接近零,则需要使用次梯度方法,其中参数设置为零.请参阅软阈值运算符 max(0, ..) 这里.

This would be incorrect because the parameters would almost certainly never go to zero (within floating point error) as intended with L1 regularization. The L1 norm is not differentiable when a parameter is zero, so subgradient methods need to be used where the parameters are set to zero if close enough to zero in the optimization routine. See the soft threshold operator max(0, ..) here.

Tensorflow/Keras 是这样做的,还是随机梯度下降不切实际?

Does Tensorflow/Keras do this, or is this impractical to do with stochastic gradient descent?

另外这里是一篇很棒的博客文章解释 L1 正则化的软阈值算子.

Also here is a superb blog post explaining the soft thresholding operator for L1 regularization.

推荐答案

所以尽管@Joshua 回答了,还有其他三件事值得一提:

So despite @Joshua answer, there are three other things that are worth to mention:

  1. 0 中的渐变没有问题.keras 自动将其设置为 1 类似于 relu 的情况.
  2. 请记住,小于 1e-6 的值实际上等于 0,因为这是 float32 精度.
  3. 由于基于梯度下降算法的性质(并设置高 l1 值),因为梯度不连续可能会发生振荡.理解想象一下,对于给定的权重 w = 0.005,您的学习率等于 0.01 并且主要损失的梯度等于 0写到 w.因此,您的体重将按以下方式更新:

  1. There is no problem connected with a gradient in 0. keras is automatically setting it to 1 similarly to relu case.
  2. Remember that values lesser than 1e-6 are actually equal to 0 as this is float32 precision.
  3. The problem of not having most of the values set to 0 might arise due to computational reasons due to the nature of a gradient-descent based algorithm (and setting a high l1 value) because of oscillations which might occur due to gradient discontinuity. To understand imagine that for a given weight w = 0.005 your learning rate is equal to 0.01 and a gradient of the main loss is equal to 0 w.r.t. to w. So your weight would be updated in the following manner:

w = 0.005 - 1 * 0.01 = -0.05 (because gradient is equal to 1 as w > 0),

第二次更新后:

w = -0.005 + 1 * 0.01 = 0.05 (because gradient is equal to -1 as w < 0).

正如您所看到的,即使您应用了 l1 正则化,w 的绝对值并没有减少,这是由于基于梯度的算法的性质而发生的.当然,这是简化的情况,但是在使用 l1 规范正则化器时,您可能会经常遇到这种振荡行为.

As you may see the absolute value of w hasn't decreased even though you applied l1 regularization and this happened due to the nature of the gradient-based algorithm. Of course, this is simplified situation but you could experience such oscillating behavior really often when using l1 norm regularizer.

这篇关于Keras/Tensorflow 中的 L1 正则化*真的* L1 正则化吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆