具有大量零特征的归一化/标准化数据是否好 [英] Is it good to normalization/standardization data having large number of features with zeros

查看:142
本文介绍了具有大量零特征的归一化/标准化数据是否好的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我拥有大约60个特征的数据,并且大多数时候在我的训练数据中大多数将为零,只有2-3个cols可能具有值(准确地说是其性能日志数据).但是,我的测试数据在其他一些列中将具有一些值.

I'm having data with around 60 features and most will be zeros most of the time in my training data only 2-3 cols may have values( to be precise its perf log data). however, my test data will have some values in some other columns.

我已经完成了归一化/标准化(分别尝试两次)并将其提供给PCA/SVD(分别尝试了两次).我使用这些功能来拟合我的模型,但是结果非常不准确.

I've done normalization/standardization(tried both separately) and feed it to PCA/SVD(tried both separately). I used these features in to fit my model but, it is giving very inaccurate results.

如果我跳过归一化/标准化步骤,直接将数据输入PCA/SVD,然后再输入模型,其结果将是准确的(几乎超过90%的准确度).

Whereas, if I skip normalization/standardization step and directly feed my data to PCA/SVD and then to the model, its giving accurate results(almost above 90% accuracy).

P.S .:我必须进行异常检测,因此要使用Isolation Forest算法.

P.S.: I've to do anomaly detection so using Isolation Forest algo.

为什么这些结果有所不同?

why these results are varying?

推荐答案

归一化和标准化(取决于来源,有时它们被等同地使用,因此在这种情况下,我不确定每个人的确切含义,但是不重要)是一项通用建议,通常在数据或多或少均匀分布的问题中通常能很好地起作用.但是,按照定义,异常检测不是那种问题.如果您有一个数据集,其中大多数示例都属于类A,而只有少数几个示例属于类B,则稀疏特征(特征几乎总是为零)很有可能(如果没有必要)区分您的问题.对它们进行归一化基本上会将它们变为零或几乎为零,这使分类器(或PCA/SVD)很难真正掌握它们的重要性.因此,如果跳过标准化,获得更好的准确性并不是不合理的,并且您不应该仅仅因为您应该这样做"而就觉得自己做错了.

Normalization and standarization (depending on the source they sometimes are used equivalently, so I'm not sure what you mean exactly by each one in this case, but it's not important) are a general recommendation that usually works well in problems where the data is more or less homogeneously distributed. Anomaly detection however is, by definition, not that kind of problem. If you have a data set where most of the examples belong to class A and only a few belong to class B, it is possible (if not necessary) that sparse features (features that are almost always zero) are actually very discriminative for your problem. Normalizing them will basically turn them to zero or almost zero, making it hard for a classifier (or PCA/SVD) to actually grasp their importance. So it is not unreasonable that you get better accuracy if you skip the normalization, and you shouldn't feel you are doing it "wrong" just because you are "supposed to do it"

我没有异常检测方面的经验,但是我有些数据集不平衡.您可以考虑某种形式的加权归一化",其中每个特征的均值和方差的计算都使用与该类中的示例数成反比的值加权(例如,examples_A ^ alpha / (examples_A ^ alpha + examples_B ^ alpha),而alpha为一些小的负数)数字).如果稀疏特征的比例非常不同(例如,一个比例在90%的情况下为0,在10%的情况下为3,而另一个在90%的情况下为0,而在10%的情况下为80),则可以将它们缩放为常见范围(例如[0,1]).

I don't have experience with anomaly detection, but I have some with unbalanced data sets. You could consider some form of "weighted normalization", where the computation of the mean and variance of each feature is weighted with a value inversely proportional to the number of examples in the class (e.g. examples_A ^ alpha / (examples_A ^ alpha + examples_B ^ alpha), with alpha some small negative number). If your sparse features have very different scales (e.g. one is 0 in 90% of cases and 3 in 10% of cases and another is 0 in 90% of cases and 80 in 10% of cases), you could just scale them to a common range (e.g. [0, 1]).

在任何情况下,正如我所说,不要仅仅因为应该起作用就应用技术.如果某些问题不适用于您的问题或特定数据集,则不使用它是正确的(并且尝试了解为什么不起作用可能会产生一些有用的见解).

In any case, as I said, do not apply techniques just because they are supposed to work. If something doesn't work for your problem or particular dataset, you are rightful not to use it (and trying to understand why it doesn't work may yield some useful insights).

这篇关于具有大量零特征的归一化/标准化数据是否好的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆