Matlab中的PCA选择前n个组件 [英] PCA in matlab selecting top n components

查看:95
本文介绍了Matlab中的PCA选择前n个组件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从矩阵中选择顶部的N=10,000主成分. pca完成后,MATLAB应该返回一个pxp矩阵,但是不会!

I want to select the top N=10,000 principal components from a matrix. After the pca is completed, MATLAB should return a pxp matrix, but it doesn't!

>> size(train_data)
ans =
         400      153600

>> [coefs,scores,variances] = pca(train_data);
>> size(coefs)
ans =
      153600         399

>> size(scores)
ans =

   400   399
>> size(variances)
ans =
    399     1

应该是coefs:153600 x 153600吗?和scores:400 X 153600?

当我使用下面的代码时,它给我一个内存不足的错误::

When I use the below code it gives me an Out of Memory error::

>> [V D] = eig(cov(train_data));
Out of memory. Type HELP MEMORY for your options.

Error in cov (line 96)
    xy = (xc' * xc) / (m-1);

我不明白为什么MATLAB返回较小的维数矩阵.它 应该以pca返回错误:153600 * 153600 * 8 bytes = 188 GB

I don't understand why MATLAB returns a lesser dimensional matrix. It should return an error with pca: 153600*153600*8 bytes=188 GB

Eigs错误:

>> eigs(cov(train_data));
Out of memory. Type HELP MEMORY for your options.

Error in cov (line 96)
    xy = (xc' * xc) / (m-1);

推荐答案

前言

我认为您是 XY问题,因为尝试在数据中查找153.600维完全是非物理的,所以请询问问题(X)而不是提出的解决方案(Y),以获得有意义的答案.我将仅使用这篇文章来告诉您为什么PCA在这种情况下不是是一个很好的选择.我无法告诉您什么将解决您的问题,因为您还没有告诉我们那是什么.

Foreword

I think you are falling prey to the XY problem, since trying to find 153.600 dimensions in your data is completely non-physical, please ask about the problem (X) and not your proposed solution (Y) in order to get a meaningful answer. I will use this post only to tell you why PCA is not a good fit in this case. I cannot tell you what will solve your problem, since you have not told us what that is.

这是一个数学上不合理的问题,我将在这里尝试解释.

This is a mathematically unsound problem, as I will try to explain here.

PCA是减小尺寸的一种方法.这意味着在您的问题中某个地方,您有一百五十三万六千六百个维度在浮动.好多啊.太多了.解释所有这些因素存在的物理原因可能比尝试解决数学问题更大.

PCA is, as user3149915 said, a way to reduce dimensions. This means that somewhere in your problem you have one-hundred-fifty-three-thousand-six-hundred dimensions floating around. That's a lot. A heck of a lot. Explaining a physical reason for the existence of all of them might be a bigger problem than trying to solve the mathematical problem.

尝试将这么多维度仅容纳400个观测值是行不通的,因为即使所有观测值都是特征空间中的线性独立矢量,您仍然只能提取399个维度,因为其余的维度根本找不到,因为没有观察.您最多可以通过N个点拟合N-1个唯一的尺寸,其他尺寸则具有无限数量的位置可能性.就像尝试通过两个点拟合一个平面一样:有一条直线可以穿过这些点,而第三维则垂直于该直线,但在旋转方向上未定义.因此,剩下的无限可能的平面就可以穿过这两个点.

Trying to fit that many dimensions to only 400 observations will not work, since even if all observations are linear independent vectors in your feature space, you can still extract only 399 dimensions, since the rest simply cannot be found since there are no observations. You can at most fit N-1 unique dimensions through N points, the other dimensions have an infinite number of possibilities of location. Like trying to fit a plane through two points: there's a line you can fit through those and the third dimension will be perpendicular to that line, but undefined in the rotational direction. Hence, you are left with an infinite number of possible planes that fit through those two points.

我不认为您要在前400个组件之后尝试填充噪声",我想您要在此之后填充一个空隙.您已使用所有数据来获取维度,并且无法创建更多维度.不可能的.您所能做的就是获得更多观察值,大约150万,然后再次执行PCA.

I do not think you are trying to fit "noise" after the first 400 components, I think you are fitting a void after that. You used all your data to get the dimensions and cannot create more dimensions. Impossible. All you can do is get more observations, some 1.5M, and do the PCA again.

为什么您需要比尺寸更多的观测值?你可能会问.很简单,您无法通过一个点拟合一条唯一的直线,也不能通过两个点拟合一个唯一的平面,也不能通过400个点拟合一个唯一的153.600维超平面.

Why do you need more observations than dimensions? you might ask. Easy, you cannot fit a unique line through a point, nor a unique plane through two points, nor a unique 153.600 dimensional hyperplane through 400 points.

不幸的是,没有.如果您有两个点并通过一条直线,您将获得100%的契合度.没错,周杰伦!做完这一天,让我们回家看电视!可悲的是,由于您的身体不健康,您的老板将在第二天早上给您打电话.为什么?好吧,例如,如果您周围散布了20个点,则拟合并非没有错误,但至少更接近于表示您的实际数据,因为前两个可能是离群值,请参见此非常说明性的图,其中红色点将是您的前两个观察结果:

Sadly, no. If you have two points and fit a line through it you get a 100% fit. No error, jay! Done for the day, let's go home and watch TV! Sadly, your boss will call you in the next morning since your fit is rubbish. Why? Well, if you'd have for instance 20 points scattered around, the fit would not be without errors, but at least closer to representing your actual data, since the first two could be outliers, see this very illustrative figure, where the red points would be your first two observations:

如果要提取前10.000个分量,则将是399个精确拟合和9601个零尺寸.甚至也不要尝试计算超出399th维的值,并将其粘贴到具有10.000个条目的零数组中.

If you were to extract the first 10.000 components, that'd be 399 exact fits and 9601 zero dimensions. Might as well not even attempt to calculate beyond the 399th dimension, and stick that into a zero array with 10.000 entries.

TL; DR ,只要您不告诉我们您的问题是什么,就不能使用PCA,我们也不能帮助您解决问题.

TL;DR You cannot use PCA and we cannot help you solve your problem as long as you do not tell us what your problem is.

这篇关于Matlab中的PCA选择前n个组件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆