R中的主成分分析(PCA):要使用哪个函数? [英] principal component analysis (PCA) in R: which function to use?

查看:678
本文介绍了R中的主成分分析(PCA):要使用哪个函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能解释prcomp和princomp函数之间的主要区别是什么?

Can anyone explain what the major differences between the prcomp and princomp functions are?

我是否应该选择一个理由而不是另一个理由?如果这是相关的,我正在研究的应用程序类型是对基因组(表达)数据集的质量控制分析.

Is there any particular reason why I should choose one over the other? In case this is relevant, the type of application I am looking at is a quality control analysis for genomic (expression) data sets.

谢谢!

推荐答案

这两个函数w/r/t有区别

There are differences between these two functions w/r/t

  • 功能参数(在调用 功能);
  • 每个
  • 返回的值;和
  • 每个人用来计算本金的
  • 数字技术 组件.
  • the function parameters (what you can/must pass in when you call the function);
  • the values returned by each; and
  • the numerical technique used by each to calculate principal components.


尤其是princomp应该更快 (并且性能差异会随着数据矩阵的大小而增加),因为它可以通过 特征向量来计算主成分协方差矩阵上的分解 ,而 prcomp 则通过 奇异值分解 来计算主成分原始数据矩阵上的strong>(SVD).

In particular, princomp should be a lot faster (and the performance difference will increase with the size of the data matrix) given that it calculates principal components via eigenvector decomposition on the covariance matrix, versus prcomp which calculates principal components via singular value decomposition (SVD) on the original data matrix.

特征值反压缩仅针对平方矩阵定义(因为该技术只是求解特征多项式),但这不是实际的限制,因为特征值反压缩始终涉及从原始矩阵进行计算的谓词步骤.数据矩阵,协方差矩阵.

Eigenvalue decomp is only defined for square matrices (because the the technique is just solving the characteristic polynomial) but that's not a practical limitation because the eigenvalue decomp always involves the predicate step of calculating from the original data matrix, the covariance matrix.

协方差矩阵不仅是正方形,而且通常比原始数据矩阵小得多(只要属性的数量小于行数或n

Not only is the covariance matrix square, but is is usually much smaller than the original data matrix (as long as the number of attributes is less than the number of rows, or n < m, which is true in most of the time.

前者(特征向量反压缩)精度较低(差异通常并不重要),但速度更快,因为计算是在协方差矩阵上进行的,而不是在原始数据矩阵上进行的;因此,例如,如果数据矩阵具有通常的形状,使得n >> m,即1000行和10列,则协方差矩阵为10 x 10;相比之下,prcomp会在原始的1000 x 10矩阵上计算SVD.

The former (eigenvector decomp) is less accurate (the difference is often not material), but much faster because computation is performed on the covariance matrix rather than on the original data matrix; so for instance, if the data matrix has the usual shape such that n >> m, i.e., 1000 rows and 10 columns, then the covariance matrix is 10 x 10; by contrast prcomp calculates SVD on the original 1000 x 10 matrix.

我不知道基因组表达数据的数据矩阵形状,但是如果行数为数千甚至数百,那么 prcomp 会比 princomp慢很多 em>.我不知道您的情况,例如,是否在较大的数据流中以单个步骤执行pca,以及是否需要考虑网络性能(执行速度),所以我不能说这种性能是否确实与您的使用相关案子.同样,很难说这两种技术之间的数值精度差异是否显着,并且实际上取决于数据.

I don't know the shape of data matrices for genomic expression data, but if the rows are in the thousands or even hundreds, then prcomp will be noticeably slower than princomp. I don't know your context, eg, whether pca is performed as a single step in a larger data flow and whether net performance (execution speed) is of concern, so i can't say whether this performance is indeed relevant for your use case. Likewise, it's difficult to say whether the difference in numerical accuracy between the two techniques is significant and in fact it depends on the data.

princomp 返回由七个项组成的列表; prcomp 返回 5 的列表.

princomp returns a list comprised of seven items; prcomp returns a list of five.

> names(pc1)    # prcomp
    [1] "sdev"     "rotation" "center"   "scale"    "x"       

> names(pc2)    # princomp
    [1] "sdev"     "loadings" "center"   "scale"    "n.obs"    "scores"   "call"    

对于 princomp ,返回的最重要的项目是组件得分加载.

For princomp, the most important items returnd are component scores and loadings.

两个函数返回的值可以通过以下方式进行协调(比较): prcomp 返回除其他外的名为 rotation 的矩阵,该矩阵等效于加载 princomp 返回的矩阵.

The values returned by the two functions can be reconciled (compared) this way: prcomp returns, among other things, a matrix called rotation which is equivalent to the loadings matrix returned by princomp.

如果将 prcomp的 旋转矩阵乘以原始数据矩阵,则结果将存储在键入x的矩阵中

if you multiply prcomp's rotation matrix by the original data matrix the result is stored in the matrix keyed to x

最后, prcomp 具有 plot 方法,该方法给出了 scree图(显示相对和累积重要性每个变量/列-我认为PCA最有用的可视化).

finally, prcomp has a plot method which gives a scree plot (shows the relative and cumulative importance of each variable/column--the most useful visualization of PCA in my opinion).

prcomp将按比例缩放(以单位方差为单位)并平均为您的数据居中.鉴于您可以使用scale函数将数据缩放和平均居中,因此两者之间的区别很小.

prcomp will scale (to unit variance) and mean center your data for you if you set to TRUE the arguments scale and center. That's a trivial difference between the two given that you can both scale and mean center your data in a single line using the scale function.

这篇关于R中的主成分分析(PCA):要使用哪个函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆