sklearn PCA.transform在不同的试验中给出不同的结果 [英] sklearn PCA.transform gives different results for different trials
问题描述
我正在使用sklearn.decomposition.PCA进行一些PCA.我发现如果输入矩阵X大,则PCA.transform的两个不同PCA实例的结果将不同.例如,当X是100x200矩阵时,就不会有问题.当X是1000x200或100x2000矩阵时,两个不同PCA实例的结果将不同.我不确定这是什么原因:我想sklearn的PCA求解器中没有随机元素?我正在使用sklearn版本0.18.1.使用python 2.7
I am doing some PCA using sklearn.decomposition.PCA. I found that if the input matrix X is big, the results of two different PCA instances for PCA.transform will not be the same. For example, when X is a 100x200 matrix, there will not be a problem. When X is a 1000x200 or a 100x2000 matrix, the results of two different PCA instances will be different. I am not sure what's the cause for this: I suppose there is no random elements in sklearn's PCA solver? I am using sklearn version 0.18.1. with python 2.7
下面的脚本说明了这个问题.
The script below illustrates the issue.
import numpy as np
import sklearn.linear_model as sklin
from sklearn.decomposition import PCA
n_sample,n_feature = 100,200
X = np.random.rand(n_sample,n_feature)
pca_1 = PCA(n_components=10)
pca_1.fit(X)
X_transformed_1 = pca_1.transform(X)
pca_2 = PCA(n_components=10)
pca_2.fit(X)
X_transformed_2 = pca_2.transform(X)
print(np.sum(X_transformed_1 == X_transformed_2) )
print(np.mean((X_transformed_1 - X_transformed_2)**2) )
推荐答案
PCA中有一个svd_solver
参数,默认情况下其值为"auto".根据输入数据的大小,它会选择最有效的求解器.
There's a svd_solver
param in PCA and by default it has value "auto". Depending on the input data size, it chooses most efficient solver.
现在根据您的情况,当尺寸大于500时,它将选择randomized
.
Now as for your case, when size is larger than 500, it will choose randomized
.
svd_solver:字符串{'auto','full','arpack','randomized'}
svd_solver : string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}
自动:
默认情况下,将基于X.shape和n_components选择求解器:如果输入数据大于500x500,并且 要提取的成分数量少于最小成分的80% 数据维度,则更有效的随机化"方法是 已启用.否则,将计算确切的完整SVD,并且可以选择 之后被截断.
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
要控制随机求解器的行为,可以在PCA中设置random_state
参数,该参数将控制随机数生成器.
To control how the randomized solver behaves, you can set random_state
param in PCA which will control the random number generator.
尝试使用
pca_1 = PCA(n_components=10, random_state=SOME_INT)
pca_2 = PCA(n_components=10, random_state=SOME_INT)
这篇关于sklearn PCA.transform在不同的试验中给出不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!