l1/l2正则化在vowpal wabbit中使所有特征权重为零是否合理? [英] Is it reasonable for l1/l2 regularization to cause all feature weights to be zero in vowpal wabbit?

查看:252
本文介绍了l1/l2正则化在vowpal wabbit中使所有特征权重为零是否合理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从vw获得了一个奇怪的结果,该结果使用在线学习方案进行逻辑回归.当我添加--l1--l2正则化时,我得到的所有预测都为0.5(这意味着所有特征均为0)

I got a weird result from vw, which uses online learning scheme for logistic regression. And when I add --l1 or --l2 regularization then I got all predictions at 0.5 (that means all features are 0)

这是我的命令:

vw -d training_data.txt --loss_function logistic -f model_l1 --invert_hash model_readable_l1 --l1 0.05 --link logistic

...这是学习过程的信息:

...and here's learning process info:

using l1 regularization = 0.05
final_regressor = model_l1
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = training_data.txt
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.693147 0.693147            1            1.0  -1.0000   0.5000      120
0.423779 0.154411            2            2.0  -1.0000   0.1431      141
0.325755 0.227731            4            4.0  -1.0000   0.1584      139
0.422596 0.519438            8            8.0  -1.0000   0.4095      147
0.501649 0.580701           16           16.0  -1.0000   0.4638      139
0.509752 0.517856           32           32.0  -1.0000   0.4876      131
0.571194 0.632636           64           64.0   1.0000   0.2566      140
0.572743 0.574291          128          128.0  -1.0000   0.4292      139
0.597763 0.622783          256          256.0  -1.0000   0.4936      143
0.602377 0.606992          512          512.0   1.0000   0.4996      147
0.647667 0.692957         1024         1024.0  -1.0000   0.5000      119
0.670407 0.693147         2048         2048.0  -1.0000   0.5000      146
0.681777 0.693147         4096         4096.0  -1.0000   0.5000      115
0.687462 0.693147         8192         8192.0  -1.0000   0.5000      145
0.690305 0.693147        16384        16384.0  -1.0000   0.5000      145
0.691726 0.693147        32768        32768.0  -1.0000   0.5000      116
0.692437 0.693147        65536        65536.0  -1.0000   0.5000      117
0.692792 0.693147       131072       131072.0  -1.0000   0.5000      117
0.692970 0.693147       262144       262144.0  -1.0000   0.5000      147

顺便说一句,特征的数量接近80,000,每个样本仅包含其中的一小部分(这就是为什么current features仅100左右)的原因.

BTW, the number of features are nearly 80,000 and each sample contains only tiny part of it(that why current features only 100 around).

这是我的猜测,在目标函数/损失函数中,第二项regularization loss可能会主导整个方程,从而导致这种现象?

Here's my guess, in objective function/loss function, the second term regularization loss might dominate the whole equation, which lead to this phenomenon?

loss = example_loss + regularization_loss

然后我尝试另一个数据集(前一天)

And I try another dataset (the other day's)

$vw-hypersearch -L 1e-10 5e-4 vw --l1 % training_data.txt 
vw-hypersearch: -L: using log-space search
trying 1.38099196677199e-06 ...................... 0.121092 (best)
trying 3.62058586892961e-08 ...................... 0.116472 (best)
trying 3.81427762457755e-09 ...................... 0.116095 (best)
trying 9.49219282204347e-10 ...................... 0.116084 (best)
trying 4.01833137620189e-10 ...................... 0.116083 (best)
trying 2.36222250814353e-10 ...................... 0.116083 (best)
loss(2.36222e-10) == loss(4.01833e-10): 0.116083
trying 3.08094024967111e-10 ...................... 0.116083 (best)
3.08094e-10 0.116083

推荐答案

您正确地怀疑:正则化项在损耗计算中占主导地位,从而导致这一结果.这是因为在命令行--l1 0.05上传递的正则化参数太大.

As you correctly suspected: the regularization term dominates the loss calculation, leading to this result. This is because the regularization argument that was passed on the command line --l1 0.05, is too large.

为什么这种方式起作用? vw--l1(对--l2同样适用)正则化值直接应用于计算出的梯度总和.即使用的值是 绝对 ,而不是 相对 .收敛后,梯度总和通常接近零,因此正则化值占主导地位.由于学习率趋于平稳(由于L1太大,因此为时过早),学习者无法从更多示例中提取更多信息.

Why does it work this way? vw applies the --l1 (the same applies to --l2) regularization value directly to the calculated sum-of-gradients. i.e. the value used is absolute rather than relative. After some convergence, the sum-of-gradients often gets close to zero so the regularization value dominates it. As the learning rate plateaus (too early due to the large L1), the learner can't extract more information from further examples.

--l1设置为较高的值,会在收敛过程中施加较高的要求.

Setting --l1 to a high value, imposes a high floor on the convergence process.

如上面的vw-hypersearch结果所示,使用更小的--l正则化项可以显着改善最终结果:

As the vw-hypersearch result above shows, use of a much smaller --l regularization term can improve the end result significantly:

+----------+----------------+
| l1 value | final avg loss |
+----------+----------------+
| 5.1e-02  |       0.692970 |
| 3.1e-10  |       0.116083 |
+----------+----------------+

这篇关于l1/l2正则化在vowpal wabbit中使所有特征权重为零是否合理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆