Matlab-PCA分析和多维数据重构 [英] Matlab - PCA analysis and reconstruction of multi dimensional data

查看:624
本文介绍了Matlab-PCA分析和多维数据重构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个庞大的多维数据集(132个维度).

I have a large dataset of multidimensional data(132 dimensions).

我是执行数据挖掘的初学者,我想通过使用Matlab进行主成分分析.但是,我看到网络上解释了很多功能,但我不知道应该如何应用它们.

I am a beginner at performing data mining and I want to apply Principal Components Analysis by using Matlab. However, I have seen that there are a lot of functions explained on the web but I do not understand how should they be applied.

基本上,我想应用PCA并从我的数据中获取特征向量及其对应的特征值.

Basically, I want to apply PCA and to obtain the eigenvectors and their corresponding eigenvalues out of my data.

完成此步骤后,我希望能够根据选择的特征向量对数据进行重构.

After this step I want to be able to do a reconstruction for my data based on a selection of the obtained eigenvectors.

我可以手动执行此操作,但是我想知道是否有任何预定义的函数可以执行此操作,因为它们应该已经被优化.

I can do this manually, but I was wondering if there are any predefined functions which can do this because they should already be optimized.

我的初始数据类似于:size(x) = [33800 132].所以基本上我有132个要素(尺寸)和33800个数据点.我想对此数据集执行PCA.

My initial data is something like : size(x) = [33800 132]. So basically I have 132 features(dimensions) and 33800 data points. And I want to perform PCA on this data set.

任何帮助或提示都可以.

Any help or hint would do.

推荐答案

这里是快速演练.首先,我们为您的隐藏变量(或因素")创建一个矩阵.它有100个观测值,并且有两个独立的因素.

Here's a quick walkthrough. First we create a matrix of your hidden variables (or "factors"). It has 100 observations and there are two independent factors.

>> factors = randn(100, 2);

现在创建一个载荷矩阵.这将把隐藏的变量映射到您观察到的变量上.假设您观察到的变量具有四个特征.然后,您的荷载矩阵必须为4 x 2

Now create a loadings matrix. This is going to map the hidden variables onto your observed variables. Say your observed variables have four features. Then your loadings matrix needs to be 4 x 2

>> loadings = [
      1   0
      0   1
      1   1
      1  -1   ];

这告诉您第一个观察到的变量负载在第一个因子上,第二个观察到的负载在第二个因子上,第三个观察到的变量在因子总和上,第四个观察到的变量在因子之差上.

That tells you that the first observed variable loads on the first factor, the second loads on the second factor, the third variable loads on the sum of factors and the fourth variable loads on the difference of the factors.

现在创建您的观察结果:

Now create your observations:

>> observations = factors * loadings' + 0.1 * randn(100,4);

我添加了少量随机噪声以模拟实验误差.现在,我们使用统计信息工具箱中的pca函数执行PCA:

I added a small amount of random noise to simulate experimental error. Now we perform the PCA using the pca function from the stats toolbox:

>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);

变量score是主成分分数的数组.这些将通过构造正交,您可以检查-

The variable score is the array of principal component scores. These will be orthogonal by construction, which you can check -

>> corr(score)
ans =
    1.0000    0.0000    0.0000    0.0000
    0.0000    1.0000    0.0000    0.0000
    0.0000    0.0000    1.0000    0.0000
    0.0000    0.0000    0.0000    1.0000

组合score * coeff'将复制观测值的居中版本.在执行PCA之前减去均值mu.要重现您的原始观察结果,您需要将其重新添加,

The combination score * coeff' will reproduce the centered version of your observations. The mean mu is subtracted prior to performing PCA. To reproduce your original observations you need to add it back in,

>> reconstructed = score * coeff' + repmat(mu, 100, 1);
>> sum((observations - reconstructed).^2)
ans =
   1.0e-27 *
    0.0311    0.0104    0.0440    0.3378

要获得原始数据的近似值,可以开始从计算出的主要成分中删除列.为了了解要删除的列,我们检查了explained变量

To get an approximation to your original data, you can start dropping columns from the computed principal components. To get an idea of which columns to drop, we examine the explained variable

>> explained
explained =
   58.0639
   41.6302
    0.1693
    0.1366

这些条目告诉您每个主要组成部分都解释了方差的百分比.我们可以清楚地看到,前两个成分比后两个成分更重要(它们解释了它们之间超过99%的方差).使用前两个分量来重构观测值,即可得出秩2近似值,

The entries tell you what percentage of the variance is explained by each of the principal components. We can clearly see that the first two components are more significant than the second two (they explain more than 99% of the variance between them). Using the first two components to reconstruct the observations gives the rank-2 approximation,

>> approximationRank2 = score(:,1:2) * coeff(:,1:2)' + repmat(mu, 100, 1);

我们现在可以尝试绘图:

We can now try plotting:

>> for k = 1:4
       subplot(2, 2, k);
       hold on;
       grid on
       plot(approximationRank2(:, k), observations(:, k), 'x');
       plot([-4 4], [-4 4]);
       xlim([-4 4]);
       ylim([-4 4]);
       title(sprintf('Variable %d', k));
   end

我们对原始观察结果进行了几乎完美的再现.如果我们想要一个更粗略的近似,我们可以使用第一个主成分:

We get an almost perfect reproduction of the original observations. If we wanted a coarser approximation, we could just use the first principal component:

>> approximationRank1 = score(:,1) * coeff(:,1)' + repmat(mu, 100, 1);

并绘制它,

>> for k = 1:4
       subplot(2, 2, k);
       hold on;
       grid on
       plot(approximationRank1(:, k), observations(:, k), 'x');
       plot([-4 4], [-4 4]);
       xlim([-4 4]);
       ylim([-4 4]);
       title(sprintf('Variable %d', k));
   end

这次重建不是很好.那是因为我们故意将数据构造为具有两个因素,而我们只是从其中一个因素中重建数据.

This time the reconstruction isn't so good. That's because we deliberately constructed our data to have two factors, and we're only reconstructing it from one of them.

请注意,尽管我们在构造原始数据和复制数据的方式上存在相似性,

Note that despite the suggestive similarity between the way we constructed the original data and its reproduction,

>> observations  = factors * loadings'  +  0.1 * randn(100,4);
>> reconstructed = score   * coeff'     +  repmat(mu, 100, 1);

factorsscore之间或在loadingscoeff之间不一定有任何对应关系. PCA算法对数据的构造方式一无所知-它仅尝试解释每个连续分量所能提供的全部总方差.

there is not necessarily any correspondence between factors and score, or between loadings and coeff. The PCA algorithm doesn't know anything about the way your data is constructed - it merely tries to explain as much of the total variance as it can with each successive component.

用户@Mari在评论中询问她如何根据主成分的数量绘制重构误差.使用上面的变量explained相当容易.我将生成一些具有更有趣的因子结构的数据来说明效果-

User @Mari asked in the comments how she could plot the reconstruction error as a function of the number of principal components. Using the variable explained above this is quite easy. I'll generate some data with a more interesting factor structure to illustrate the effect -

>> factors = randn(100, 20);
>> loadings = chol(corr(factors * triu(ones(20))))';
>> observations = factors * loadings' + 0.1 * randn(100, 20);

现在,所有观察结果都以一个重要的共同因素为重,而其他因素的重要性正在降低.我们可以像以前一样得到PCA分解

Now all of the observations load on a significant common factor, with other factors of decreasing importance. We can get the PCA decomposition as before

>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);

并按如下所示绘制解释方差的百分比,

and plot the percentage of explained variance as follows,

>> cumexplained = cumsum(explained);
   cumunexplained = 100 - cumexplained;
   plot(1:20, cumunexplained, 'x-');
   grid on;
   xlabel('Number of factors');
   ylabel('Unexplained variance')

这篇关于Matlab-PCA分析和多维数据重构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆