为什么规范化特征值不会对训练输出产生太大影响? [英] why normalizing feature values doesn't change the training output much?

查看:100
本文介绍了为什么规范化特征值不会对训练输出产生太大影响?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在大小为78的密集特征向量上,我有3113个训练示例.特征的大小不同:大约20个,大约200K.例如,这是训练示例之一,以vowpal-wabbit输入格式.

I have 3113 training examples, over a dense feature vector of size 78. The magnitude of features is different: some around 20, some 200K. For example, here is one of the training examples, in vowpal-wabbit input format.

0.050000 1 '2006-07-10_00:00:00_0.050000| F0:9.670000 F1:0.130000 F2:0.320000 F3:0.570000 F4:9.837000 F5:9.593000 F6:9.238150 F7:9.646667 F8:9.631333 F9:8.338904 F10:9.748000 F11:10.227667 F12:10.253667 F13:9.800000 F14:0.010000 F15:0.030000 F16:-0.270000 F17:10.015000 F18:9.726000 F19:9.367100 F20:9.800000 F21:9.792667 F22:8.457452 F23:9.972000 F24:10.394833 F25:10.412667 F26:9.600000 F27:0.090000 F28:0.230000 F29:0.370000 F30:9.733000 F31:9.413000 F32:9.095150 F33:9.586667 F34:9.466000 F35:8.216658 F36:9.682000 F37:10.048333 F38:10.072000 F39:9.780000 F40:0.020000 F41:-0.060000 F42:-0.560000 F43:9.898000 F44:9.537500 F45:9.213700 F46:9.740000 F47:9.628000 F48:8.327233 F49:9.924000 F50:10.216333 F51:10.226667 F52:127925000.000000 F53:-15198000.000000 F54:-72286000.000000 F55:-196161000.000000 F56:143342800.000000 F57:148948500.000000 F58:118894335.000000 F59:119027666.666667 F60:181170133.333333 F61:89209167.123288 F62:141400600.000000 F63:241658716.666667 F64:199031688.888889 F65:132549.000000 F66:-16597.000000 F67:-77416.000000 F68:-205999.000000 F69:144690.000000 F70:155022.850000 F71:122618.450000 F72:123340.666667 F73:187013.300000 F74:99751.769863 F75:144013.200000 F76:237918.433333 F77:195173.377778

训练效果不佳,所以我想将这些功能标准化以使其具有相同的大小.我计算了所有示例中每个特征的均值和标准差,然后执行newValue = (oldValue - mean) / stddev,这样它们的新meanstddev都为1.对于同一示例,这是归一化后的特征值:

The training result was not good, so I thought I would normalize the features to make them in the same magnitude. I calculated mean and standard deviation for each of the features across all examples, then do newValue = (oldValue - mean) / stddev, so that their new mean and stddev are all 1. For the same example, here is the feature values after normalization:

0.050000 1 '2006-07-10_00:00:00_0.050000| F0:-0.660690 F1:0.226462 F2:0.383638 F3:0.398393 F4:-0.644898 F5:-0.670712 F6:-0.758233 F7:-0.663447 F8:-0.667865 F9:-0.960165 F10:-0.653406 F11:-0.610559 F12:-0.612965 F13:-0.659234 F14:0.027834 F15:0.038049 F16:-0.201668 F17:-0.638971 F18:-0.668556 F19:-0.754856 F20:-0.659535 F21:-0.663001 F22:-0.953793 F23:-0.642736 F24:-0.606725 F25:-0.609946 F26:-0.657141 F27:0.173106 F28:0.310076 F29:0.295814 F30:-0.644357 F31:-0.678860 F32:-0.764422 F33:-0.658869 F34:-0.674367 F35:-0.968679 F36:-0.649145 F37:-0.616868 F38:-0.619564 F39:-0.649498 F40:0.041261 F41:-0.066987 F42:-0.355693 F43:-0.638604 F44:-0.676379 F45:-0.761250 F46:-0.653962 F47:-0.668194 F48:-0.962591 F49:-0.635441 F50:-0.611600 F51:-0.615670 F52:-0.593324 F53:-0.030322 F54:-0.095290 F55:-0.139602 F56:-0.652741 F57:-0.675629 F58:-0.851058 F59:-0.642028 F60:-0.648002 F61:-0.952896 F62:-0.629172 F63:-0.592340 F64:-0.682273 F65:-0.470121 F66:-0.045396 F67:-0.128265 F68:-0.185295 F69:-0.510251 F70:-0.515335 F71:-0.687727 F72:-0.512749 F73:-0.471032 F74:-0.789335 F75:-0.491188 F76:-0.400105 F77:-0.505242

但是,这会产生基本相同的测试结果(如果不完全相同,因为我在每次训练之前都将示例进行了混洗).

However, this yields basically the same testing result (if not exactly the same, since I shuffle the examples before each training).

想知道为什么结果没有变化吗?

Wondering why there is no change in the result?

这是我的训练和测试命令:

Here is my training and testing commands:

rm -f cache
cat input.feat | vw -f model --passes 20 --cache_file cache
cat input.feat | vw -i model -t -p predictions --invert_hash readable_model

(是的,我现在正在测试训练数据上,因为我只需要训练很少的数据示例.)

(Yes, I'm testing on the training data right now since I have only very few data examples to train on.)

更多上下文:

其中一些功能是第2层"-它们是通过对第1层"功能(例如移动平均线,1-3阶导数等)进行操纵或做叉积而得出的.如果我在计算第2层特征之前先对第1层特征进行归一化,那实际上会大大改善模型.

Some of the features are "tier 2" - they were derived by manipulating or doing cross products on "tier 1" features (e.g. moving average, 1-3 order of derivatives, etc). If I normalize the tier 1 features before calculating the tier 2 features, it would actually improve the model significantly.

因此,我感到困惑的是,为什么规范第1层功能(在生成第2层功能之前)有很大帮助,而规范化所有功能(在生成第2层功能之后)根本无济于事?

So I'm puzzled as why normalizing tier 1 features (before generating tier 2 features) helps a lot, while normalizing all features (after generating tier 2 features) doesn't help at all?

顺便说一句,由于我正在训练回归器,因此我使用SSE作为衡量模型质量的指标.

BTW, since I'm training a regressor, I'm using SSE as the metrics to judge the quality of the model.

推荐答案

vw默认情况下对缩放的要素值进行归一化.

vw normalizes feature values for scale as it goes, by default.

这是在线算法的一部分.它是在运行时逐步完成的.

This is part of the online algorithm. It is done gradually during runtime.

实际上,它的作用还不止于此,vw增强型SGD算法还保持了单独的学习率(每个功能),因此稀有的功能学习率的衰减速度不及常见的学习速度(--adaptive).最后,还有一个重要性感知更新,由第三个选项(--invariant)控制.

In fact it does more than that, vw enhanced SGD algorithm also keeps separate learning rates (per feature) so rarer feature learning rates don't decay as fast as common ones (--adaptive). Finally there's an importance aware update, controlled by a 3rd option (--invariant).

3个单独的SGD增强选项(默认情况下均已打开 )为:

The 3 separate SGD enhancement options (which are all turned on by default) are:

  • --adaptive
  • --invariant
  • --normalized
  • --adaptive
  • --invariant
  • --normalized

最后一个选项是调整比例值(折扣大值与小值)的选项.您可以使用选项--sgd禁用所有这些SGD增强功能.您也可以通过明确指定任何子集来部分启用它.

The last option is the one that adjust values for scale (discounts large values vs small). You may disable all these SGD enhancements by using the option --sgd. You may also partially enable any subset by explicitly specifying it.

总共有您可以使用的2^3 = 8 SGD选项组合.

All in all you have 2^3 = 8 SGD option combinations you can use.

这篇关于为什么规范化特征值不会对训练输出产生太大影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆