Matlab——多维数据的PCA分析与重构 [英] Matlab - PCA analysis and reconstruction of multi dimensional data

查看:38
本文介绍了Matlab——多维数据的PCA分析与重构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型多维数据集(132 维).

I have a large dataset of multidimensional data(132 dimensions).

我是执行数据挖掘的初学者,我想使用 Matlab 应用主成分分析.但是我看到网上有很多功能解释,但不明白应该如何应用.

I am a beginner at performing data mining and I want to apply Principal Components Analysis by using Matlab. However, I have seen that there are a lot of functions explained on the web but I do not understand how should they be applied.

基本上,我想应用 PCA 并从我的数据中获取特征向量及其相应的特征值.

Basically, I want to apply PCA and to obtain the eigenvectors and their corresponding eigenvalues out of my data.

在这一步之后,我希望能够根据获得的特征向量的选择对我的数据进行重建.

After this step I want to be able to do a reconstruction for my data based on a selection of the obtained eigenvectors.

我可以手动执行此操作,但我想知道是否有任何预定义的函数可以执行此操作,因为它们应该已经进行了优化.

I can do this manually, but I was wondering if there are any predefined functions which can do this because they should already be optimized.

我的初始数据类似于:size(x) = [33800 132].所以基本上我有 132 个特征(维度)和 33800 个数据点.我想对这个数据集进行 PCA.

My initial data is something like : size(x) = [33800 132]. So basically I have 132 features(dimensions) and 33800 data points. And I want to perform PCA on this data set.

任何帮助或提示都可以.

Any help or hint would do.

推荐答案

这是一个快速演练.首先,我们创建一个包含隐藏变量(或因子")的矩阵.它有 100 个观测值,并且有两个独立的因素.

Here's a quick walkthrough. First we create a matrix of your hidden variables (or "factors"). It has 100 observations and there are two independent factors.

>> factors = randn(100, 2);

现在创建一个载荷矩阵.这会将隐藏变量映射到您观察到的变量上.假设您观察到的变量有四个特征.那么你的载荷矩阵需要是 4 x 2

Now create a loadings matrix. This is going to map the hidden variables onto your observed variables. Say your observed variables have four features. Then your loadings matrix needs to be 4 x 2

>> loadings = [
      1   0
      0   1
      1   1
      1  -1   ];

这告诉您,第一个观察到的变量负载第一个因子,第二个变量负载第二个因子,第三个变量负载因子总和,第四个变量负载因子差.

That tells you that the first observed variable loads on the first factor, the second loads on the second factor, the third variable loads on the sum of factors and the fourth variable loads on the difference of the factors.

现在创建您的观察:

>> observations = factors * loadings' + 0.1 * randn(100,4);

我添加了少量随机噪声来模拟实验误差.现在我们使用统计工具箱中的 pca 函数执行 PCA:

I added a small amount of random noise to simulate experimental error. Now we perform the PCA using the pca function from the stats toolbox:

>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);

变量 score 是主成分分数的数组.这些将通过构造正交,您可以检查 -

The variable score is the array of principal component scores. These will be orthogonal by construction, which you can check -

>> corr(score)
ans =
    1.0000    0.0000    0.0000    0.0000
    0.0000    1.0000    0.0000    0.0000
    0.0000    0.0000    1.0000    0.0000
    0.0000    0.0000    0.0000    1.0000

组合 score * coeff' 将重现您观察的中心版本.在执行 PCA 之前减去平均值 mu.要重现您的原始观察结果,您需要将其重新添加进来,

The combination score * coeff' will reproduce the centered version of your observations. The mean mu is subtracted prior to performing PCA. To reproduce your original observations you need to add it back in,

>> reconstructed = score * coeff' + repmat(mu, 100, 1);
>> sum((observations - reconstructed).^2)
ans =
   1.0e-27 *
    0.0311    0.0104    0.0440    0.3378

要获得原始数据的近似值,您可以开始从计算的主成分中删除列.为了了解要删除哪些列,我们检查了 explained 变量

To get an approximation to your original data, you can start dropping columns from the computed principal components. To get an idea of which columns to drop, we examine the explained variable

>> explained
explained =
   58.0639
   41.6302
    0.1693
    0.1366

这些条目会告诉您每个主成分解释的方差百分比.我们可以清楚地看到前两个分量比后两个分量更显着(它们解释了它们之间 99% 以上的差异).使用前两个分量来重建观测值给出了 2 级近似值,

The entries tell you what percentage of the variance is explained by each of the principal components. We can clearly see that the first two components are more significant than the second two (they explain more than 99% of the variance between them). Using the first two components to reconstruct the observations gives the rank-2 approximation,

>> approximationRank2 = score(:,1:2) * coeff(:,1:2)' + repmat(mu, 100, 1);

我们现在可以尝试绘图:

We can now try plotting:

>> for k = 1:4
       subplot(2, 2, k);
       hold on;
       grid on
       plot(approximationRank2(:, k), observations(:, k), 'x');
       plot([-4 4], [-4 4]);
       xlim([-4 4]);
       ylim([-4 4]);
       title(sprintf('Variable %d', k));
   end

我们几乎完美地再现了原始观察结果.如果我们想要更粗略的近似值,我们可以只使用第一个主成分:

We get an almost perfect reproduction of the original observations. If we wanted a coarser approximation, we could just use the first principal component:

>> approximationRank1 = score(:,1) * coeff(:,1)' + repmat(mu, 100, 1);

并绘制它,

>> for k = 1:4
       subplot(2, 2, k);
       hold on;
       grid on
       plot(approximationRank1(:, k), observations(:, k), 'x');
       plot([-4 4], [-4 4]);
       xlim([-4 4]);
       ylim([-4 4]);
       title(sprintf('Variable %d', k));
   end

这次的重建不太好.那是因为我们故意将数据构建为具有两个因素,而我们只是从其中一个因素重建它.

This time the reconstruction isn't so good. That's because we deliberately constructed our data to have two factors, and we're only reconstructing it from one of them.

请注意,尽管我们构建原始数据的方式与其复制方式之间存在暗示性的相似性,

Note that despite the suggestive similarity between the way we constructed the original data and its reproduction,

>> observations  = factors * loadings'  +  0.1 * randn(100,4);
>> reconstructed = score   * coeff'     +  repmat(mu, 100, 1);

factorsscoreloadingscoeff 之间不一定有任何对应关系.PCA 算法对数据的构造方式一无所知 - 它只是尝试尽可能多地解释每个连续分量的总方差.

there is not necessarily any correspondence between factors and score, or between loadings and coeff. The PCA algorithm doesn't know anything about the way your data is constructed - it merely tries to explain as much of the total variance as it can with each successive component.

用户@Mari 在评论中询问她如何将重建误差绘制为主成分数量的函数.使用上面的变量 explained 很容易.我会用更有趣的因子结构生成一些数据来说明效果 -

User @Mari asked in the comments how she could plot the reconstruction error as a function of the number of principal components. Using the variable explained above this is quite easy. I'll generate some data with a more interesting factor structure to illustrate the effect -

>> factors = randn(100, 20);
>> loadings = chol(corr(factors * triu(ones(20))))';
>> observations = factors * loadings' + 0.1 * randn(100, 20);

现在所有的观察结果都加载在一个重要的公因子上,其他因子的重要性逐渐降低.我们可以像以前一样得到PCA分解

Now all of the observations load on a significant common factor, with other factors of decreasing importance. We can get the PCA decomposition as before

>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);

并绘制解释方差的百分比如下,

and plot the percentage of explained variance as follows,

>> cumexplained = cumsum(explained);
   cumunexplained = 100 - cumexplained;
   plot(1:20, cumunexplained, 'x-');
   grid on;
   xlabel('Number of factors');
   ylabel('Unexplained variance')

这篇关于Matlab——多维数据的PCA分析与重构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆