Keras/Tensorflow中的L1正则化是否真的* L1正则化? [英] Is the L1 regularization in Keras/Tensorflow *really* L1-regularization?

查看:155
本文介绍了Keras/Tensorflow中的L1正则化是否真的* L1正则化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用keras.regularizers.l1(0.01)对Keras中的神经网络参数进行L1正则化,以获得稀疏模型.我发现,虽然我的许多系数 close 都为零,但实际上几乎没有为零.

I am employing L1 regularization on my neural network parameters in Keras with keras.regularizers.l1(0.01) to obtain a sparse model. I am finding that, while many of my coefficients are close to zero, few of them are actually zero.

查看正则化的源代码 ,这表明Keras只是将参数的L1范数添加到损失函数中.

Upon looking at the source code for the regularization, it suggests that Keras simply adds the L1 norm of the parameters to the loss function.

这将是不正确的,因为参数几乎肯定不会像L1正则化那样永远变为零(在浮点错误内).当参数为零时,L1范数是不可微的,因此,如果在优化例程中参数足够接近零,则将参数设置为零时,需要使用次梯度方法.请参见软阈值运算符max(0, ..) 此处.

This would be incorrect because the parameters would almost certainly never go to zero (within floating point error) as intended with L1 regularization. The L1 norm is not differentiable when a parameter is zero, so subgradient methods need to be used where the parameters are set to zero if close enough to zero in the optimization routine. See the soft threshold operator max(0, ..) here.

Tensorflow/Keras是否这样做,或者这与随机梯度下降法不切实际吗?

Does Tensorflow/Keras do this, or is this impractical to do with stochastic gradient descent?

也在此处是一篇出色的博客文章解释了用于L1正则化的软阈值运算符.

Also here is a superb blog post explaining the soft thresholding operator for L1 regularization.

推荐答案

因此,尽管@Joshua回答了,但还有三件事值得一提:

So despite @Joshua answer, there are three other things that are worth to mention:

  1. 0中的渐变无关. keras会自动将其设置为1,与relu情况类似.
  2. 请记住,小于1e-6的值实际上等于0,因为这是float32精度.
  3. 由于基于梯度下降的算法的性质(并设置高的l1值),由于振荡,由于计算原因,可能会出现未将大多数值设置为0的问题这可能是由于梯度不连续而发生的.可以理解的是,对于给定的权重w = 0.005,您的学习率等于0.01,而主要损失的梯度等于0w.r.t.到w.因此,您的体重将通过以下方式更新:

  1. There is no problem connected with a gradient in 0. keras is automatically setting it to 1 similarly to relu case.
  2. Remember that values lesser than 1e-6 are actually equal to 0 as this is float32 precision.
  3. The problem of not having most of the values set to 0 might arise due to computational reasons due to the nature of a gradient-descent based algorithm (and setting a high l1 value) because of oscillations which might occur due to gradient discontinuity. To understand imagine that for a given weight w = 0.005 your learning rate is equal to 0.01 and a gradient of the main loss is equal to 0 w.r.t. to w. So your weight would be updated in the following manner:

w = 0.005 - 1 * 0.01 = -0.05 (because gradient is equal to 1 as w > 0),

并且在第二次更新之后:

and after the second update:

w = -0.005 + 1 * 0.01 = 0.05 (because gradient is equal to -1 as w < 0).

如您所见,即使您应用了l1正则化,w的绝对值也没有减小,而这是由于基于梯度的算法的性质而发生的.当然,这是简化的情况,但是使用l1规范正则化工具时,您可能经常会遇到这种振荡行为.

As you may see the absolute value of w hasn't decreased even though you applied l1 regularization and this happened due to the nature of the gradient-based algorithm. Of course, this is simplified situation but you could experience such oscillating behavior really often when using l1 norm regularizer.

这篇关于Keras/Tensorflow中的L1正则化是否真的* L1正则化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆