matlab中的PCA选择前n个分量 [英] PCA in matlab selecting top n components

查看:31
本文介绍了matlab中的PCA选择前n个分量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从矩阵中选择顶部 N=10,000 个主成分.pca 完成后,MATLAB 应该返回一个 pxp 矩阵,但它没有!

<代码>>>大小(火车数据)答案 =400 153600>>[系数,分数,方差] = pca(train_data);>>大小(系数)答案 =153600 399>>尺寸(分数)答案 =400 399>>尺寸(差异)答案 =399 1

应该是 coefs:153600 x 153600?和 scores:400 X 153600?

当我使用下面的代码时,它给了我一个内存不足错误::

<代码>>>[V D] = eig(cov(train_data));内存不足.键入 HELP MEMORY 作为您的选项.cov 中的错误(第 96 行)xy = (xc' * xc)/(m-1);

我不明白为什么 MATLAB 会返回一个维数较小的矩阵.它应该用 pca 返回错误:153600*153600*8 bytes=188 GB

eigs 错误:

<代码>>>eigs(cov(train_data));内存不足.键入 HELP MEMORY 作为您的选项.cov 中的错误(第 96 行)xy = (xc' * xc)/(m-1);

解决方案

前言

我认为您正在成为

如果您要提取前 10.000 个分量,那就是 399 个精确拟合和 9601 个零维度.甚至不尝试计算超过 399 维,并将其粘贴到具有 10.000 个条目的零数组中.

TL;DR您不能使用 PCA,只要您不告诉我们您的问题是什么,我们就无法帮助您解决问题.

I want to select the top N=10,000 principal components from a matrix. After the pca is completed, MATLAB should return a pxp matrix, but it doesn't!

>> size(train_data)
ans =
         400      153600

>> [coefs,scores,variances] = pca(train_data);
>> size(coefs)
ans =
      153600         399

>> size(scores)
ans =

   400   399
>> size(variances)
ans =
    399     1

It should be coefs:153600 x 153600? and scores:400 X 153600?

When I use the below code it gives me an Out of Memory error::

>> [V D] = eig(cov(train_data));
Out of memory. Type HELP MEMORY for your options.

Error in cov (line 96)
    xy = (xc' * xc) / (m-1);

I don't understand why MATLAB returns a lesser dimensional matrix. It should return an error with pca: 153600*153600*8 bytes=188 GB

Error with eigs:

>> eigs(cov(train_data));
Out of memory. Type HELP MEMORY for your options.

Error in cov (line 96)
    xy = (xc' * xc) / (m-1);

解决方案

Foreword

I think you are falling prey to the XY problem, since trying to find 153.600 dimensions in your data is completely non-physical, please ask about the problem (X) and not your proposed solution (Y) in order to get a meaningful answer. I will use this post only to tell you why PCA is not a good fit in this case. I cannot tell you what will solve your problem, since you have not told us what that is.

This is a mathematically unsound problem, as I will try to explain here.

PCA

PCA is, as user3149915 said, a way to reduce dimensions. This means that somewhere in your problem you have one-hundred-fifty-three-thousand-six-hundred dimensions floating around. That's a lot. A heck of a lot. Explaining a physical reason for the existence of all of them might be a bigger problem than trying to solve the mathematical problem.

Trying to fit that many dimensions to only 400 observations will not work, since even if all observations are linear independent vectors in your feature space, you can still extract only 399 dimensions, since the rest simply cannot be found since there are no observations. You can at most fit N-1 unique dimensions through N points, the other dimensions have an infinite number of possibilities of location. Like trying to fit a plane through two points: there's a line you can fit through those and the third dimension will be perpendicular to that line, but undefined in the rotational direction. Hence, you are left with an infinite number of possible planes that fit through those two points.

After the first 400 components, there's no more dimensions left. You are fitting a void after that. You used all your data to get the dimensions and cannot create more dimensions. Impossible. All you can do is get more observations, some 1.5M, and do the PCA again.

More observations than dimensions

Why do you need more observations than dimensions? you might ask. Easy, you cannot fit a unique line through a point, nor a unique plane through two points, nor a unique 153.600 dimensional hyperplane through 400 points.

So, if I get 153.600 observations I'm set?

Sadly, no. If you have two points and fit a line through it you get a 100% fit. No error, jay! Done for the day, let's go home and watch TV! Sadly, your boss will call you in the next morning since your fit is rubbish. Why? Well, if you'd have for instance 20 points scattered around, the fit would not be without errors, but at least closer to representing your actual data, since the first two could be outliers, see this very illustrative figure, where the red points would be your first two observations:

If you were to extract the first 10.000 components, that'd be 399 exact fits and 9601 zero dimensions. Might as well not even attempt to calculate beyond the 399th dimension, and stick that into a zero array with 10.000 entries.

TL;DR You cannot use PCA and we cannot help you solve your problem as long as you do not tell us what your problem is.

这篇关于matlab中的PCA选择前n个分量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆