进行功能选择,PCA和规范化的正确顺序? [英] Right order of doing feature selection, PCA and normalization?

查看:234
本文介绍了进行功能选择,PCA和规范化的正确顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道特征选择可以帮助我删除可能贡献不高的特征.我知道PCA有助于将可能相关的特征简化为一个,从而减小了尺寸.我知道规范化会将特征转换为相同的比例.

I know that feature selection helps me remove features that may have low contribution. I know that PCA helps reduce possibly correlated features into one, reducing the dimensions. I know that normalization transforms features to the same scale.

但是有建议的命令来执行这三个步骤吗?从逻辑上讲,我认为我应该首先通过选择特征来消除不良特征,然后对其进行归一化,最后使用PCA缩小尺寸并使特征尽可能彼此独立.

But is there a recommended order to do these three steps? Logically I would think that I should weed out bad features by feature selection first, followed by normalizing them, and finally use PCA to reduce dimensions and make the features as independent from each other as possible.

这种逻辑正确吗?

奖金问题-还有更多事情要做(预处理或转换) 在将特征输入估算器之前先了解特征?

Bonus question - are there any more things to do (preprocess or transform) to the features before feeding them into the estimator?

推荐答案

如果我要进行某种分类,我会亲自使用此顺序

If I were doing a classifier of some sort I would personally use this order

  1. 归一化
  2. PCA
  3. 功能选择

归一化:首先要进行归一化,以使数据处于合理范围内.如果您有数据(x,y)且范围为x is from -1000 to +1000y is from -1 to +1,您会看到任何距离度量标准都会自动说出y的变化不如X的变化重要.我们不知道这是案子呢.因此,我们希望对数据进行归一化.

Normalization: You would do normalization first to get data into reasonable bounds. If you have data (x,y) and the range of x is from -1000 to +1000 and y is from -1 to +1 You can see any distance metric would automatically say a change in y is less significant than a change in X. we don't know that is the case yet. So we want to normalize our data.

PCA:使用数据的特征值分解来找到描述数据点方差的正交基集.如果您有4个特征,PCA可以向您显示只有2个特征真正区分了数据点,这使我们进入了最后一步

PCA: Uses the eigenvalue decomposition of data to find an orthogonal basis set that describes the variance in data points. If you have 4 characteristics, PCA can show you that only 2 characteristics really differentiate data points which brings us to the last step

特征选择:一旦有了一个可以更好地描述数据的坐标空间,就可以选择哪些特征是显着的.通常,您将使用PCA中最大的特征值(EV)及其对应的特征向量进行表示.由于较大的EV表示该数据方向上的差异更大,因此可以在隔离要素中获得更大的粒度.这是减少问题尺寸的好方法.

Feature Selection: once you have a coordinate space that better describes your data you can select which features are salient.Typically you'd use the largest eigenvalues(EVs) and their corresponding eigenvectors from PCA for your representation. Since larger EVs mean there is more variance in that data direction, you can get more granularity in isolating features. This is a good method to reduce number of dimensions of your problem.

当然,这可能因问题而异,但这只是通用指南.

of course this could change from problem to problem, but that is simply a generic guide.

这篇关于进行功能选择,PCA和规范化的正确顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆