l1/l2正则化在vowpal wabbit中使所有特征权重为零是否合理? [英] Is it reasonable for l1/l2 regularization to cause all feature weights to be zero in vowpal wabbit?
问题描述
我从vw
获得了一个奇怪的结果,该结果使用在线学习方案进行逻辑回归.当我添加--l1
或--l2
正则化时,我得到的所有预测都为0.5(这意味着所有特征均为0)
I got a weird result from vw
, which uses online learning scheme for logistic regression. And when I add --l1
or --l2
regularization then I got all predictions at 0.5 (that means all features are 0)
这是我的命令:
vw -d training_data.txt --loss_function logistic -f model_l1 --invert_hash model_readable_l1 --l1 0.05 --link logistic
...这是学习过程的信息:
...and here's learning process info:
using l1 regularization = 0.05
final_regressor = model_l1
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = training_data.txt
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.693147 0.693147 1 1.0 -1.0000 0.5000 120
0.423779 0.154411 2 2.0 -1.0000 0.1431 141
0.325755 0.227731 4 4.0 -1.0000 0.1584 139
0.422596 0.519438 8 8.0 -1.0000 0.4095 147
0.501649 0.580701 16 16.0 -1.0000 0.4638 139
0.509752 0.517856 32 32.0 -1.0000 0.4876 131
0.571194 0.632636 64 64.0 1.0000 0.2566 140
0.572743 0.574291 128 128.0 -1.0000 0.4292 139
0.597763 0.622783 256 256.0 -1.0000 0.4936 143
0.602377 0.606992 512 512.0 1.0000 0.4996 147
0.647667 0.692957 1024 1024.0 -1.0000 0.5000 119
0.670407 0.693147 2048 2048.0 -1.0000 0.5000 146
0.681777 0.693147 4096 4096.0 -1.0000 0.5000 115
0.687462 0.693147 8192 8192.0 -1.0000 0.5000 145
0.690305 0.693147 16384 16384.0 -1.0000 0.5000 145
0.691726 0.693147 32768 32768.0 -1.0000 0.5000 116
0.692437 0.693147 65536 65536.0 -1.0000 0.5000 117
0.692792 0.693147 131072 131072.0 -1.0000 0.5000 117
0.692970 0.693147 262144 262144.0 -1.0000 0.5000 147
顺便说一句,特征的数量接近80,000,每个样本仅包含其中的一小部分(这就是为什么current features
仅100左右)的原因.
BTW, the number of features are nearly 80,000 and each sample contains only tiny part of it(that why current features
only 100 around).
这是我的猜测,在目标函数/损失函数中,第二项regularization loss
可能会主导整个方程,从而导致这种现象?
Here's my guess, in objective function/loss function, the second term regularization loss
might dominate the whole equation, which lead to this phenomenon?
loss = example_loss + regularization_loss
然后我尝试另一个数据集(前一天)
And I try another dataset (the other day's)
$vw-hypersearch -L 1e-10 5e-4 vw --l1 % training_data.txt
vw-hypersearch: -L: using log-space search
trying 1.38099196677199e-06 ...................... 0.121092 (best)
trying 3.62058586892961e-08 ...................... 0.116472 (best)
trying 3.81427762457755e-09 ...................... 0.116095 (best)
trying 9.49219282204347e-10 ...................... 0.116084 (best)
trying 4.01833137620189e-10 ...................... 0.116083 (best)
trying 2.36222250814353e-10 ...................... 0.116083 (best)
loss(2.36222e-10) == loss(4.01833e-10): 0.116083
trying 3.08094024967111e-10 ...................... 0.116083 (best)
3.08094e-10 0.116083
推荐答案
您正确地怀疑:正则化项在损耗计算中占主导地位,从而导致这一结果.这是因为在命令行--l1 0.05
上传递的正则化参数太大.
As you correctly suspected: the regularization term dominates the loss calculation, leading to this result. This is because the regularization argument that was passed on the command line --l1 0.05
, is too large.
为什么这种方式起作用? vw
将--l1
(对--l2
同样适用)正则化值直接应用于计算出的梯度总和.即使用的值是 绝对 ,而不是 相对 .收敛后,梯度总和通常接近零,因此正则化值占主导地位.由于学习率趋于平稳(由于L1太大,因此为时过早),学习者无法从更多示例中提取更多信息.
Why does it work this way? vw
applies the --l1
(the same applies to --l2
) regularization value directly to the calculated sum-of-gradients. i.e. the value used is absolute rather than relative. After some convergence, the sum-of-gradients often gets close to zero so the regularization value dominates it. As the learning rate plateaus (too early due to the large L1), the learner can't extract more information from further examples.
将--l1
设置为较高的值,会在收敛过程中施加较高的要求.
Setting --l1
to a high value, imposes a high floor on the convergence process.
如上面的vw-hypersearch
结果所示,使用更小的--l
正则化项可以显着改善最终结果:
As the vw-hypersearch
result above shows, use of a much smaller --l
regularization term can improve the end result significantly:
+----------+----------------+
| l1 value | final avg loss |
+----------+----------------+
| 5.1e-02 | 0.692970 |
| 3.1e-10 | 0.116083 |
+----------+----------------+
这篇关于l1/l2正则化在vowpal wabbit中使所有特征权重为零是否合理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!