选择内核和超参数以减少内核 PCA [英] Selecting kernel and hyperparameters for kernel PCA reduction
问题描述
我正在阅读 使用 Scikit 进行机器学习实践-Learn 和 TensorFlow:构建智能系统的概念、工具和技术
我正在尝试优化无监督内核 PCA 算法.这是一些上下文:
I'm trying to optimize an unsupervised kernel PCA algorithm. Here is some context:
另一种完全无监督的方法是选择产生最低的重构误差.然而,重建并不像线性 PCA 那样容易
Another approach, this time entirely unsupervised, is to select the kernel and hyperparameters that yield the lowest reconstruction error. However, reconstruction is not as easy as with linear PCA
....
幸运的是,可以在原始空间中找到一个点将映射到重建点附近.这被称为重建原像.一旦你有了这个预像,你就可以测量它到原始实例的平方距离.然后你可以选择最小化这个的内核和超参数重建原像错误.
Fortunately, it is possible to find a point in the original space that would map close to the reconstructed point. This is called the reconstruction pre-image. Once you have this pre-image, you can measure its squared distance to the original instance. You can then select the kernel and hyperparameters that minimize this reconstruction pre-image error.
一种解决方案是训练监督回归模型,其中投影实例作为训练集,原始实例作为目标.
One solution is to train a supervised regression model, with the projected instances as the training set and the original instances as the targets.
现在您可以使用带有交叉验证的网格搜索来查找内核和超参数,最大限度地减少这种前图像重建误差.
Now you can use grid search with cross-validation to find the kernel and hyperparameters that minimize this pre-image reconstruction error.
书中提供的无需交叉验证即可执行重构的代码是:
The code provided in the book to perfom the reconstruction without cross validation is:
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433,fit_inverse_transform=True)
X_reduced = rbf_pca.fit_transform(X)
X_preimage = rbf_pca.inverse_transform(X_reduced)
>>> from sklearn.metrics import mean_squared_error
>>> mean_squared_error(X, X_preimage)
32.786308795766132
我的问题是,我如何实施交叉验证来调整内核和超参数以最小化前像重建错误?
My question is, how do i go about implementing cross validation to tune the kernel and hyperparameters to minimze the pre-image reconstruction error?
这是我目前的做法:
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import KernelPCA
mean_squared_error(X, X_preimage)
kpca=KernelPCA(fit_inverse_transform=True, n_jobs=-1)
from sklearn.model_selection import GridSearchCV
param_grid = [{
"kpca__gamma": np.linspace(0.03, 0.05, 10),
"kpca__kernel": ["rbf", "sigmoid", "linear", "poly"]
}]
grid_search = GridSearchCV(clf, param_grid, cv=3, scoring='mean_squared_error')
X_reduced = kpca.fit_transform(X)
X_preimage = kpca.inverse_transform(X_reduced)
grid_search.fit(X,X_preimage)
谢谢
推荐答案
GridSearchCV
能够对无监督学习(没有 y
)进行交叉验证看到 此处在文档中一个>:
GridSearchCV
is capable of doing cross-validation of unsupervised learning (without a y
) as can be seen here in documentation:
fit(X, y=None, groups=None, **fit_params)
fit(X, y=None, groups=None, **fit_params)
...
y : array-like, shape = [n_samples] or [n_samples, n_output], optional
Target relative to X for classification or regression;
None for unsupervised learning
...
所以唯一需要处理的是如何完成评分
.
So the only thing that needs to be handled is how the scoring
will be done.
GridSearchCV 中会发生以下情况:
The following will happen in GridSearchCV:
数据
X
将根据cv
param
The data
X
will be be divided into train-test splits based on folds defined incv
param
对于您在 param_grid
中指定的每个参数组合,模型将在上述步骤中的 train
部分进行训练,然后 score
将用于 test
部分.
For each combination of parameters that you specified in param_grid
, the model will be trained on the train
part from the step above and then scoring
will be used on test
part.
每个参数组合的 scores
将合并所有折叠并取平均值.将选择性能最高的参数组合.
The scores
for each parameter combination will be combined for all the folds and averaged. Highest performing parameter combination will be selected.
现在棘手的部分是 2.默认情况下,如果您在其中提供 'string'
,它将在内部转换为 make_scorer
对象.对于 'mean_squared_error'
相关的 代码在这里:
Now the tricky part is 2. By default, if you provide a 'string'
in that, it will be converted to a make_scorer
object internally. For 'mean_squared_error'
the relevant code is here:
....
neg_mean_squared_error_scorer = make_scorer(mean_squared_error,
greater_is_better=False)
....
这是您不想要的,因为这需要 y_true
和 y_pred
.
which is what you dont want, because that requires y_true
and y_pred
.
另一种选择是让您的 自己的自定义评分器,如此处讨论的,带有签名 (estimator, X, y)
.您的情况如下所示:
The other option is to make your own custom scorer as discussed here with signature (estimator, X, y)
. Something like below for your case:
from sklearn.metrics import mean_squared_error
def my_scorer(estimator, X, y=None):
X_reduced = estimator.transform(X)
X_preimage = estimator.inverse_transform(X_reduced)
return -1 * mean_squared_error(X, X_preimage)
然后像这样在 GridSearchCV 中使用它:
Then use it in GridSearchCV like this:
param_grid = [{
"gamma": np.linspace(0.03, 0.05, 10),
"kernel": ["rbf", "sigmoid", "linear", "poly"]
}]
kpca=KernelPCA(fit_inverse_transform=True, n_jobs=-1)
grid_search = GridSearchCV(kpca, param_grid, cv=3, scoring=my_scorer)
grid_search.fit(X)
这篇关于选择内核和超参数以减少内核 PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!