选择内核和超参数以减少内核 PCA [英] Selecting kernel and hyperparameters for kernel PCA reduction

查看:52
本文介绍了选择内核和超参数以减少内核 PCA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读 使用 Scikit 进行机器学习实践-Learn 和 TensorFlow:构建智能系统的概念、工具和技术

我正在尝试优化无监督内核 PCA 算法.这是一些上下文:

I'm trying to optimize an unsupervised kernel PCA algorithm. Here is some context:

另一种完全无监督的方法是选择产生最低的重构误差.然而,重建并不像线性 PCA 那样容易

Another approach, this time entirely unsupervised, is to select the kernel and hyperparameters that yield the lowest reconstruction error. However, reconstruction is not as easy as with linear PCA

....

幸运的是,可以在原始空间中找到一个点将映射到重建点附近.这被称为重建原像.一旦你有了这个预像,你就可以测量它到原始实例的平方距离.然后你可以选择最小化这个的内核和超参数重建原像错误.

Fortunately, it is possible to find a point in the original space that would map close to the reconstructed point. This is called the reconstruction pre-image. Once you have this pre-image, you can measure its squared distance to the original instance. You can then select the kernel and hyperparameters that minimize this reconstruction pre-image error.

一种解决方案是训练监督回归模型,其中投影实例作为训练集,原始实例作为目标.

One solution is to train a supervised regression model, with the projected instances as the training set and the original instances as the targets.

现在您可以使用带有交叉验证的网格搜索来查找内核和超参数,最大限度地减少这种前图像重建误差.

Now you can use grid search with cross-validation to find the kernel and hyperparameters that minimize this pre-image reconstruction error.

书中提供的无需交叉验证即可执行重构的代码是:

The code provided in the book to perfom the reconstruction without cross validation is:

rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433,fit_inverse_transform=True)
X_reduced = rbf_pca.fit_transform(X)
X_preimage = rbf_pca.inverse_transform(X_reduced)

>>> from sklearn.metrics import mean_squared_error
>>> mean_squared_error(X, X_preimage)
32.786308795766132

我的问题是,我如何实施交叉验证来调整内核和超参数以最小化前像重建错误?

My question is, how do i go about implementing cross validation to tune the kernel and hyperparameters to minimze the pre-image reconstruction error?

这是我目前的做法:

from sklearn.metrics import mean_squared_error
from sklearn.decomposition import KernelPCA

mean_squared_error(X, X_preimage)

kpca=KernelPCA(fit_inverse_transform=True, n_jobs=-1) 

from sklearn.model_selection import GridSearchCV

param_grid = [{
        "kpca__gamma": np.linspace(0.03, 0.05, 10),
        "kpca__kernel": ["rbf", "sigmoid", "linear", "poly"]
    }]

grid_search = GridSearchCV(clf, param_grid, cv=3, scoring='mean_squared_error')
X_reduced = kpca.fit_transform(X)
X_preimage = kpca.inverse_transform(X_reduced)
grid_search.fit(X,X_preimage)

谢谢

推荐答案

GridSearchCV 能够对无监督学习(没有 y)进行交叉验证看到 此处在文档中:

GridSearchCV is capable of doing cross-validation of unsupervised learning (without a y) as can be seen here in documentation:

fit(X, y=None, groups=None, **fit_params)

fit(X, y=None, groups=None, **fit_params)

...
y : array-like, shape = [n_samples] or [n_samples, n_output], optional 
Target relative to X for classification or regression; 
None for unsupervised learning
...

所以唯一需要处理的是如何完成评分.

So the only thing that needs to be handled is how the scoring will be done.

GridSearchCV 中会发生以下情况:

The following will happen in GridSearchCV:

  1. 数据 X 将根据 cv param

  1. The data X will be be divided into train-test splits based on folds defined in cv param

对于您在 param_grid 中指定的每个参数组合,模型将在上述步骤中的 train 部分进行训练,然后 score 将用于 test 部分.

For each combination of parameters that you specified in param_grid, the model will be trained on the train part from the step above and then scoring will be used on test part.

每个参数组合的 scores 将合并所有折叠并取平均值.将选择性能最高的参数组合.

The scores for each parameter combination will be combined for all the folds and averaged. Highest performing parameter combination will be selected.

现在棘手的部分是 2.默认情况下,如果您在其中提供 'string',它将在内部转换为 make_scorer 对象.对于 'mean_squared_error' 相关的 代码在这里:

Now the tricky part is 2. By default, if you provide a 'string' in that, it will be converted to a make_scorer object internally. For 'mean_squared_error' the relevant code is here:

....
neg_mean_squared_error_scorer = make_scorer(mean_squared_error,
                                        greater_is_better=False)
....

这是您不想要的,因为这需要 y_truey_pred.

which is what you dont want, because that requires y_true and y_pred.

另一种选择是让您的 自己的自定义评分器,如此处讨论的,带有签名 (estimator, X, y).您的情况如下所示:

The other option is to make your own custom scorer as discussed here with signature (estimator, X, y). Something like below for your case:

from sklearn.metrics import mean_squared_error
def my_scorer(estimator, X, y=None):
    X_reduced = estimator.transform(X)
    X_preimage = estimator.inverse_transform(X_reduced)
    return -1 * mean_squared_error(X, X_preimage)

然后像这样在 GridSearchCV 中使用它:

Then use it in GridSearchCV like this:

param_grid = [{
        "gamma": np.linspace(0.03, 0.05, 10),
        "kernel": ["rbf", "sigmoid", "linear", "poly"]
    }]

kpca=KernelPCA(fit_inverse_transform=True, n_jobs=-1) 
grid_search = GridSearchCV(kpca, param_grid, cv=3, scoring=my_scorer)
grid_search.fit(X)

这篇关于选择内核和超参数以减少内核 PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆