重新采样未产生主成分分析的预期结果 [英] Resampling not producing expected result of principal component analysis

查看:142
本文介绍了重新采样未产生主成分分析的预期结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试遵循以下代码,以使用替换进行重采样(例如引导程序)来生成主成分分析的置信区间.我正在使用虹膜数据集的前4列:

I am trying following code to produce confidence intervals of principal component analysis using resampling with replacement (like bootstrap). I am using first 4 columns of iris dataset:

prcomp函数产生以下输出:

The prcomp function produces following output:

> mydf = iris[1:4]
> print(prcomp(mydf))
Standard deviations:
[1] 2.0562689 0.4926162 0.2796596 0.1543862

Rotation:
                     PC1         PC2         PC3        PC4
Sepal.Length  0.36138659 -0.65658877  0.58202985  0.3154872
Sepal.Width  -0.08452251 -0.73016143 -0.59791083 -0.3197231
Petal.Length  0.85667061  0.17337266 -0.07623608 -0.4798390
Petal.Width   0.35828920  0.07548102 -0.54583143  0.7536574

使用替换进行重采样:

> times = 1000
> ll = list()
> for(i in 1:times) {
+ tempdf =  mydf[sample(nrow(mydf), replace = TRUE), ]
+ ll[[length(ll)+1]] = prcomp(tempdf)$rotation
+ }
> 
> dd = data.frame(apply(simplify2array(ll), 1:2, mean))
> print(dd)
                      PC1          PC2          PC3          PC4
Sepal.Length  0.005574165 -0.039480258  0.044537991  0.007778055
Sepal.Width  -0.002587333 -0.040273812 -0.050793200 -0.005473271
Petal.Length  0.015681233  0.010952361 -0.005769051 -0.011351172
Petal.Width   0.006513656  0.008296928 -0.041805210  0.019109323

确定较低的置信区间:

> ddlower = data.frame(apply(simplify2array(ll), 1:2, quantile, probs=0.025))
> print(ddlower)
                    PC1        PC2        PC3        PC4
Sepal.Length -0.3859257 -0.7274809 -0.6560139 -0.3807826
Sepal.Width  -0.1127749 -0.7907801 -0.6818251 -0.3941001
Petal.Length -0.8633386 -0.2058064 -0.1333520 -0.4919584
Petal.Width  -0.3702979 -0.1328146 -0.6203322 -0.8088710

确定上限置信区间:

> ddupper = data.frame(apply(simplify2array(ll), 1:2, quantile, probs=0.975))
> print(ddupper)
                   PC1       PC2       PC3       PC4
Sepal.Length 0.3860431 0.7250412 0.6632126 0.3831889
Sepal.Width  0.1111863 0.7993649 0.6758156 0.3987939
Petal.Length 0.8638549 0.2106540 0.1318556 0.4915670
Petal.Width  0.3721362 0.1510708 0.6246988 0.8083421

我发现加载值有很大的不同.此外,所有变量和分量的置信区间均为0左右.我还检查了其他(大型)数据集,发现非常相似.从这些置信区间来看,所有加载都不与0显着不同.代码中显然存在一些错误,但我似乎找不到它.感谢您的帮助.

I find that the loading values are very different. Moreover, the confidence intervals are around 0 for all the variables and components. I checked with other (large) datasets also and the findings are very similar. From these confidence intervals none of the loadings are significantly different from 0. There is obviously some error in the code but I cannot seem to find it. Thanks for your help.

推荐答案

鉴于本征向量的符号未定义(您可以翻转配置并获得相同的结果),因此形成置信度没有任何意义加载的 signed 值上的时间间隔.

Given that the sign of an Eigenvector is not defined (you can flip the configuration and have the same result), it doesn't make sense to form a confidence interval on the the signed value of the loading.

而是计算加载的绝对值而不是带符号值的置信区间.

Instead compute the confidence interval on the absolute value of the loading, not the signed value.

想想当说Sepal.Length的特征向量从〜-0.3切换到〜+0.3时,您的时间间隔会怎样?从绝对尺寸的角度来看,这两种情况下的负载都是相似的.但是,当您查看实际的有符号值时,由于平均~~ 0.3s和〜0.3s的平均值,加载平均为0是合乎逻辑的.

Think what happens to your interval when the Eigenvector for say Sepal.Length flips from ~ -0.3 to ~ +0.3? The loading is similar in both cases when considered from an absolute size point of view. When you look at the actual signed value however, it would be logical for the loading to be on average 0 as you are averaging a lot of ~-0.3s and ~0.3s.

要显示原始尝试失败的原因,请运行:

To visualise why your original attempt failed, run:

set.seed(1)
mydf <- iris[1:4]
times <- 1000
ll <- vector(mode = "list", length = times)
for (i in seq_len(times)) {
  tempdf  <- mydf[sample(nrow(mydf), replace = TRUE), ]
  ll[[i]] <- prcomp(tempdf)$rotation
}

这实际上是您的代码,已根据我的需要进行了修改.现在提取PC1Sepal.Length的载荷,并绘制值的直方图:

This is effectively your code, modified to suit my sensibilities. Now extract the loading for Sepal.Length on PC1 and draw a histogram of the values:

hist(sapply(ll, `[`, 1, 1))

产生

取而代之的是根据加载的绝对值而不是有符号值来计算置信区间.

Instead compute the confidence interval on the absolute value of the loading, not the signed value.

例如

set.seed(1)
mydf <- iris[1:4]
times <- 1000
ll <- vector(mode = "list", length = times)
for (i in seq_len(times)) {
  tempdf  <- mydf[sample(nrow(mydf), replace = TRUE), ]
  ll[[i]] <- abs(prcomp(tempdf)$rotation) ## NOTE: abs(...)
}

这给出了:

> data.frame(apply(simplify2array(ll), 1:2, quantile, probs = 0.025))
                    PC1         PC2        PC3       PC4
Sepal.Length 0.33066830 0.578558222 0.45955051 0.2252653
Sepal.Width  0.05211013 0.623424084 0.49591685 0.2351746
Petal.Length 0.84823899 0.133137927 0.01226608 0.4607265
Petal.Width  0.34284824 0.007403214 0.44932031 0.6780493

> data.frame(apply(simplify2array(ll), 1:2, quantile, probs = 0.975))
                   PC1       PC2       PC3       PC4
Sepal.Length 0.3891499 0.7443276 0.6690553 0.3898237
Sepal.Width  0.1186205 0.7988607 0.7010495 0.4083784
Petal.Length 0.8653324 0.2153410 0.1450756 0.4933340
Petal.Width  0.3742441 0.1645692 0.6350899 0.8154254

这篇关于重新采样未产生主成分分析的预期结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆