Sci-kit学习PLS SVD和交叉验证 [英] Sci-kit Learn PLS SVD and cross validation

查看:143
本文介绍了Sci-kit学习PLS SVD和交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当响应变量的形状为(N,)而不是(N,1)时,Sci-kit学习中的sklearn.cross_decomposition.PLSSVD类似乎失败了,其中N是数据集中的样本数.

The sklearn.cross_decomposition.PLSSVD class in Sci-kit learn appears to be failing when the response variable has a shape of (N,) instead of (N,1), where N is the number of samples in the dataset.

但是,当响应变量的形状为(N,1)而不是(N,)时,sklearn.cross_validation.cross_val_score失败.我如何一起使用它们?

However, sklearn.cross_validation.cross_val_score fails when the response variable has a shape of (N,1) instead of (N,). How can I use them together?

一段代码:

from sklearn.pipeline import Pipeline
from sklearn.cross_decomposition import PLSSVD
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# x -> (N, 60) numpy array
# y -> (N, ) numpy array

# These are the classifier 'pieces' I'm using
plssvd = PLSSVD(n_components=5, scale=False)
logistic = LogisticRegression(penalty='l2', C=0.5)
scaler = StandardScaler(with_mean=True, with_std=True)

# Here's the pipeline that's failing
plsclf = Pipeline([('scaler', scaler),
                   ('plssvd', plssvd), 
                   ('logistic', logistic)])

# Just to show how I'm using the pipeline for a working classifier
logclf = Pipeline([('scaler', scaler),
                   ('logistic', logistic)])

##################################################################

# This works fine
log_scores = cross_validation.cross_val_score(logclf, x, y, scoring='accuracy',
                                              verbose=True, cv=5, n_jobs=4)

# This fails!
pls_scores = cross_validation.cross_val_score(plsclf, x, y, scoring='accuracy',
                                              verbose=True, cv=5, n_jobs=4)

特别是,它在第​​103行:y_std = np.ones(Y.shape[1])'IndexError: tuple index out of range'中,在cross_decomposition/pls_.pyc_center_scale_xy函数中失败,因为形状元组​​只有一个元素.

Specifically, it fails in the _center_scale_xy function of cross_decomposition/pls_.pyc with 'IndexError: tuple index out of range' at line 103: y_std = np.ones(Y.shape[1]), because the shape tuple has only one element.

如果我在PLSSVD构造函数中设置了scale=True,则它在第99行的同一函数中失败:y_std[y_std == 0.0] = 1.0,因为它试图对浮点数进行布尔索引(y_std是浮点数,因为它只有一个维度).

If I set scale=True in the PLSSVD constructor, it fails in the same function at line 99: y_std[y_std == 0.0] = 1.0, because it is attempting to do a boolean index on a float (y_std is a float, since it only has one dimension).

似乎就像一个简单的解决方法一样,只需确保y变量具有两个维度,即(N,1). 但是:

Seems, like an easy fix, just make sure the y variable has two dimensions, (N,1). However:

如果我在输出变量y中创建尺寸为(N,1)的数组,它仍然会失败.为了更改数组,我在运行cross_val_score之前将其添加:

If I create an array with dimensions (N,1) out of the output variable y, it still fails. In order to change the arrays, I add this before running cross_val_score:

y = np.transpose(np.array([y]))

然后,它在sklearn/cross_validation.py的第398行中失败:

Then, it fails in sklearn/cross_validation.py at line 398:

File "my_secret_script.py", line 293, in model_create
    scores = cross_validation.cross_val_score(plsclf, x, y, scoring='accuracy', verbose=True, cv=5, n_jobs=4)
File "/Users/my.secret.name/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1129, in cross_val_score
    cv = _check_cv(cv, X, y, classifier=is_classifier(estimator))
File "/Users/my.secret.name/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1216, in _check_cv
    cv = StratifiedKFold(y, cv, indices=needs_indices)
File "/Users/my.secret.name/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 398, in __init__
    label_test_folds = test_folds[y == label]
ValueError: boolean index array should have 1 dimension

我正在OSX,NumPy版本1.8.0,Sci-kit Learn版本0.15-git上运行此程序.

I'm running this on OSX, NumPy version 1.8.0, Sci-kit Learn version 0.15-git.

是否可以将PLSSVDcross_val_score一起使用?

Any way to use PLSSVD together with cross_val_score?

推荐答案

偏最小二乘将您的数据X和目标Y都投影到每个由n_components矢量跨越的线性子空间中.预测它们的方式是使一个预测变量在另一个变量上的回归得分最大化.

Partial least squares projects both your data X and your target Y onto linear subspaces spanned by n_components vectors each. They are projected in a way that regression scores of one projected variable on the other are maximized.

组件的数量(即潜在子空间的尺寸)受变量中要素数量的限制.变量Y仅具有一个特征(一列),因此潜在子空间是一维的,有效地将构造简化为类似于(但不完全相同)线性回归的形式.因此,在这种特定情况下使用偏最小二乘法可能没有用.

The number of components, i.e. dimensions of the latent subspaces is bounded by the number of features in your variables. Your variable Y only has one feature (one column), so the latent subspace is one-dimensional, effectively reducing your construction to something more akin to (but not exactly the same as) linear regression. So using partial least squares in this specific situation is probably not useful.

看看以下内容

import numpy as np
rng = np.random.RandomState(42)
n_samples, n_features_x, n_features_y, n_components = 20, 10, 1, 1
X = rng.randn(n_samples, n_features_x)
y = rng.randn(n_samples, n_features_y)

from sklearn.cross_decomposition import PLSSVD
plssvd = PLSSVD(n_components=n_components)

X_transformed, Y_transformed = plssvd.fit_transform(X, y)

X_transformedY_transformed是形状为n_samples, n_components的数组,它们是XY的投影版本.

X_transformed and Y_transformed are arrays of shape n_samples, n_components, they are the projected versions of X and Y.

关于在cross_val_score中的Pipeline中使用PLSSVD的问题的答案为,因为Pipeline对象调用fittransform尽可能使用变量XY作为参数,正如您在我编写的代码中看到的那样,它们返回一个包含计划的XY值.管道中的下一步将无法处理此问题,因为它将认为此元组是新的X.

The answer to your question about using PLSSVD within a Pipeline in cross_val_score, is no, it will not work out of the box, because the Pipeline object calls fit and transform using both variables X and Y as arguments if possible, which, as you can see in the code I wrote, returns a tuple containing the projected X and Y values. The next step in the pipeline will not be able to process this, because it will think that this tuple is the new X.

这种类型的故障是由于以下事实造成的:sklearn才刚刚开始系统地涉及多个目标支持.您尝试使用的PLSSVD估算器本质上是多个目标,即使您仅在一个目标上使用它.

This type of failure is due to the fact that sklearn is only beginning to be systematic about multiple target support. The PLSSVD estimator you are trying to use is inherently multi target, even if you are only using it on one target.

解决方案:不要在一维目标上使用偏最小二乘,即使它与管道一起工作也不会有收益.

Solution: Don't use partial least squares on 1D targets, there would be no gain to it even if it worked with the pipeline.

这篇关于Sci-kit学习PLS SVD和交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆