运行PCA后可以逐行标准化吗? [英] Is it okay to normalize by row after running a PCA?

查看:315
本文介绍了运行PCA后可以逐行标准化吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含5万行和26个要素的数据集.我正在使用sklearn的StandardScaler(每列具有0个均值和1个标准偏差)对列进行标准化,然后运行PCA将特征集减小至原始方差的〜90%.然后,在运行sklearn的KMeans算法之前,我要对行进行规范化.

运行PCA之后,是否有任何我不应该对行进行规范化的原因?如果存在,是否在PCA之前对行进行规范化会引起任何问题-应该在对列进行规范化之前还是之后进行?

标准化行的原因是从每行中删除幅值"或技能水平",而是查看各个PCA缩减特征之间的关系.

解决方案

这非常依赖于数据.由于我不知道这些技能水平"数字对于数据形状可能会有什么影响,因此我不愿直接给出答案.例如,是否有一些行具有[-1,1]范围之外的几个归一化分数,而另一些行的值却很小?听起来这就是您要解决的情况.

我担心您会有很多行,这些行中的几个值在1-2范围内(+或-),但是有些行中的行可能带有一个+1值,其余项接近0.您将单热"行归一化,将得到一个扩展为大于10的值.您是否希望将其聚集为离群值或包含在空间的中心区域?

在PCA之后重新进行规范化没有错.但是,如果在前后进行标准化,则不会有太大变化,因为您保留了大部分数据,仅删除了那些看起来多余的数据.

I have a dataset of 50K rows and 26 features. I'm normalizing the columns using sklearn's StandardScaler (each column has 0 mean and 1 standard deviation), then running a PCA to reduce the featureset to ~90% of the original variance. I'm then normalizing the rows, before I run sklearn's KMeans algorithm.

Is there any reason I shouldn't be normalizing the rows after running a PCA? If there is, would normalizing the rows before the PCA cause any issues - should this be done before or after normalizing the columns?

The reason for normalizing the rows is to remove the 'magnitude' or 'skill level' from each row, and instead, look at the relationship between the respective PCA-reduced features.

解决方案

This is very dependent on the data. Since I don't know what these "skill level" numbers might have for data shape, I'm hesitant to give a direct answer. For instance, is it reasonable to have some rows with several normalized scores outside the [-1, 1] range, while others have values of small magnitude? It sounds like this is the case you're trying to address.

I worry that you'll have a lot of rows with several values in the 1-2 range (either + or -), but some rows with perhaps a single +1 value with the rest of the items near 0. When you normalize a "one-hot" row, you'll get that one value expanded larger than 10. Do you want it clustered as an outlier, or included in the central region of the space? Is someone with a single more-than-mediocre trait an outlier for this data?

There's nothing wrong with re-normalizing after a PCA. However, if you normalize both before and after, you won't get much change, since you kept a large majority of the data, removing only those that seem redundant.

这篇关于运行PCA后可以逐行标准化吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆