sklearn PCA.transform在不同的试验中给出不同的结果 [英] sklearn PCA.transform gives different results for different trials

查看:396
本文介绍了sklearn PCA.transform在不同的试验中给出不同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用sklearn.decomposition.PCA进行一些PCA.我发现如果输入矩阵X大,则PCA.transform的两个不同PCA实例的结果将不同.例如,当X是100x200矩阵时,就不会有问题.当X是1000x200或100x2000矩阵时,两个不同PCA实例的结果将不同.我不确定这是什么原因:我想sklearn的PCA求解器中没有随机元素?我正在使用sklearn版本0.18.1.使用python 2.7

I am doing some PCA using sklearn.decomposition.PCA. I found that if the input matrix X is big, the results of two different PCA instances for PCA.transform will not be the same. For example, when X is a 100x200 matrix, there will not be a problem. When X is a 1000x200 or a 100x2000 matrix, the results of two different PCA instances will be different. I am not sure what's the cause for this: I suppose there is no random elements in sklearn's PCA solver? I am using sklearn version 0.18.1. with python 2.7

下面的脚本说明了这个问题.

The script below illustrates the issue.

import numpy as np
import sklearn.linear_model as sklin 
from sklearn.decomposition import PCA

n_sample,n_feature = 100,200
X = np.random.rand(n_sample,n_feature)
pca_1 = PCA(n_components=10)
pca_1.fit(X)
X_transformed_1 = pca_1.transform(X)

pca_2 = PCA(n_components=10)
pca_2.fit(X)
X_transformed_2 = pca_2.transform(X)

print(np.sum(X_transformed_1 == X_transformed_2) )
print(np.mean((X_transformed_1 - X_transformed_2)**2) )

推荐答案

PCA中有一个svd_solver参数,默认情况下其值为"auto".根据输入数据的大小,它会选择最有效的求解器.

There's a svd_solver param in PCA and by default it has value "auto". Depending on the input data size, it chooses most efficient solver.

现在根据您的情况,当尺寸大于500时,它将选择randomized.

Now as for your case, when size is larger than 500, it will choose randomized.

svd_solver:字符串{'auto','full','arpack','randomized'}

svd_solver : string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}

自动:

默认情况下,将基于X.shape和n_components选择求解器:如果输入数据大于500x500,并且 要提取的成分数量少于最小成分的80% 数据维度,则更有效的随机化"方法是 已启用.否则,将计算确切的完整SVD,并且可以选择 之后被截断.

the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.

要控制随机求解器的行为,可以在PCA中设置random_state参数,该参数将控制随机数生成器.

To control how the randomized solver behaves, you can set random_state param in PCA which will control the random number generator.

尝试使用

pca_1 = PCA(n_components=10, random_state=SOME_INT)
pca_2 = PCA(n_components=10, random_state=SOME_INT)

这篇关于sklearn PCA.transform在不同的试验中给出不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆