使用PCA选择功能 [英] selection of features using PCA

查看:83
本文介绍了使用PCA选择功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做无监督分类.为此,我有8个分类功能(绿色方差,绿色标准分区,红色平均值,红色方差,红色标准分区,色调平均值,色调变化,色调标准分区)进行分类每个图片,我想使用PCA选择3个最重要的功能.我已经为功能选择编写了以下代码 (特征尺寸为:179X8):

I am doing unsupervised classification. For this I have 8 features (Variance of Green, Std. div. of Green , Mean of Red, Variance of Red, Std. div. of Red, Mean of Hue, Variance of Hue, Std. div. of Hue) for classification per image and I want to select 3 most significant features using PCA. I have written the following code for feature selection (where dimension of feature is : 179X8) :

for c=1:size(feature,1)
   feature(c,:)=feature(c,:)-mean(feature)
end

DataCov=cov(feature); % covariance matrix
[PC,variance,explained] = pcacov(DataCov)

这给了我

PC =

0.0038   -0.0114    0.0517    0.0593    0.0039    0.3998    0.9085   -0.0922
0.0755   -0.1275    0.6339    0.6824   -0.3241   -0.0377   -0.0641    0.0052
0.7008    0.7113   -0.0040    0.0496   -0.0207    0.0042    0.0012    0.0002
0.0007   -0.0012    0.0051    0.0101    0.0272    0.0288    0.0873    0.9953
0.0320   -0.0236    0.1521    0.2947    0.9416   -0.0142   -0.0289   -0.0266
0.7065   -0.6907   -0.1282   -0.0851    0.0060    0.0003    0.0010   -0.0001
0.0026   -0.0037    0.0632   -0.0446    0.0053    0.9125   -0.4015    0.0088
0.0543   -0.0006    0.7429   -0.6574    0.0838   -0.0705    0.0311   -0.0001

方差=

0.0179
0.0008
0.0001
0.0000
0.0000
0.0000
0.0000
0.0000

解释=

94.9471
4.1346
0.6616
0.2358
0.0204
0.0003
0.0002
0.0000

这意味着第一主成分具有94.9%的方差解释,依此类推...但是这些是从最高到最低的顺序. 如何根据以上信息知道要选择的功能(从1到8).

This means first principle component has 94.9% variance explained and so on ... but these are in order of most to least significant. How can I know which features (from 1 to 8) to be selected based on above information.

推荐答案

您的问题与Mahoney和Drineas在.

Your problem is the same as the COLUMNSELECT problem discussed by Mahoney and Drineas in "CUR matrix decompositions for improved data analysis".

他们首先计算每个维度的杠杆得分,然后使用杠杆得分作为权重随机选择其中三个.或者,您可以选择最大的.这是您遇到的问题的脚本:

They first compute the leverage scores for each dimension and then selects 3 of them randomly using the leverage scores as weights. Alternatively, you can select the largest ones. Here's the script for your problem:

我首先从网络上获得了真实的自然图像,然后将其调整为您要求的尺寸.图像如下:

I first got a real nature image from the web and resized it to the dimensions you ask. The image is as follows:

%# Example data from real image of size 179x8
%# You can skip it for your own data
features = im2double(rgb2gray(imread('img.png')));

%# m samples, n dimensions
[m,n] = size(features);

然后,计算集中数据:

%# Remove the mean
features = features - repmat(mean(features,2), 1, size(features,2));

我使用SVD来计算PCA,因为它既提供了主要成分,又提供了系数.如果样本在列中,则U保留主要成分.查看本文的第二页关系.

I use SVD to compute PCA since it gives you both the principal components and the coefficients. If the samples are in columns, then U holds the principal components. Check the second page of this paper for the relationship.

%# Compute the SVD
[U,S,V] = svd(features);

这里的关键思想是我们要获得具有大部分变化的尺寸.并假设数据中存在一些噪声.我们仅选择主要特征向量,例如代表数据的95%.

The key idea here is that we want to get the dimensions having most of the variation. And an assumption is that there's some noise in data. We select only the dominant eigenvectors, e.g. representing the 95% of the data.

%# Compute the number of eigenvectors representing
%#  the 95% of the variation
coverage = cumsum(diag(S));
coverage = coverage ./ max(coverage);
[~, nEig] = max(coverage > 0.95);

然后使用主要成分的nEig计算杠杆得分.也就是说,我们采用nEig系数的范数.

Then the leverage scores are computed using nEig of the principal components. That is, we take the norm of the nEig coefficients.

%# Compute the norms of each vector in the new space
norms = zeros(n,1);
for i = 1:n
    norms(i) = norm(V(i,1:nEig))^2;
end

然后,我们可以对杠杆得分进行排序:

Then, we can sort the leverage scores:

%# Get the largest 3
[~, idx] = sort(norms);
idx(1:3)'

并获得具有最大杠杆得分的向量的索引:

and get the indices of the vectors with the largest leverage scores:

ans =
   6     8     5

您可以检查纸张以获取更多详细信息.

You can check the paper for more details.

但是,请记住,如果您有很多维度,则基于PCA的技术是很好的.在您的情况下,搜索空间很小.我的建议是在该空间中进行详尽搜索,并按照@amit的建议获得最佳选择.

But, keep in mind that PCA-based technique is good if you have many many dimensions. In your case, the search space is very small. My advice is to search exhaustively in the space and get the best selection as @amit recommends.

这篇关于使用PCA选择功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆