如何选择pca之后最相关的前100个功能(子集)? [英] How to select top 100 features(a subset) which are most relevant after pca?

查看:109
本文介绍了如何选择pca之后最相关的前100个功能(子集)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在63 * 2308矩阵上执行了PCA,并获得了得分和系数矩阵.分数矩阵的尺寸为63 * 2308,系数矩阵的尺寸为2308 * 2308.

如何提取最重要的前100个功能的列名,以便对它们执行回归分析?

解决方案

PCA应该为您提供一组特征向量(您的系数矩阵)和一个特征值向量(通常为lambda)(1 * 2308).您可能曾经在matlab中使用其他PCA函数来获取它们.

特征值指示每个特征向量解释多少数据.选择特征的一种简单方法是选择特征值最高的100个特征.这为您提供了一组功能,可以解释数据中的大多数差异.

如果您需要证明编写方法的合理性,则可以实际计算出每个特征向量所解释的方差量,并减少例如95%的方差.

请记住,仅基于特征值进行选择可能不对应于对回归最重要的特征集,因此,如果您没有获得预期的性能,则可能需要尝试不同的特征选择方法,例如递归特征选择.我建议使用Google学术搜索找到几篇类似的文章,看看它们使用什么方法.


使用PCA获取前100个主要组件的快速matlab示例.

[eigenvectors, projected_data, eigenvalues] = princomp(X);
[foo, feature_idx] = sort(eigenvalues, 'descend');
selected_projected_data = projected(:, feature_idx(1:100));

I performed PCA on a 63*2308 matrix and obtained a score and a co-efficient matrix. The score matrix is 63*2308 and the co-efficient matrix is 2308*2308 in dimensions.

How do i extract the column names for the top 100 features which are most important so that i can perform regression on them?

解决方案

PCA should give you both a set of eigenvectors (your co-efficient matrix) and a vector of eigenvalues (1*2308) often referred to as lambda). You might been to use a different PCA function in matlab to get them.

The eigenvalues indicate how much of your data each eigenvector explains. A simple method for selecting features would be to select the 100 features with the highest eigen values. This gives you a set of feature which explain most of the variance in the data.

If you need to justify your approach for a write up you can actually calculate the amount of variance explained per eigenvector and cut of at, for example, 95% variance explained.

Bear in mind that selecting based solely on eigenvalue, might not correspond to the set of features most important to your regression, so if you don't get the performance you expect you might want to try a different feature selection method such as recursive feature selection. I would suggest using google scholar to find a couple of papers doing something similar and see what methods they use.


A quick matlab example of taking the top 100 principle components using PCA.

[eigenvectors, projected_data, eigenvalues] = princomp(X);
[foo, feature_idx] = sort(eigenvalues, 'descend');
selected_projected_data = projected(:, feature_idx(1:100));

这篇关于如何选择pca之后最相关的前100个功能(子集)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆