Sci-kit学习PLS SVD和交叉验证 [英] Sci-kit Learn PLS SVD and cross validation
问题描述
当响应变量的形状为(N,)
而不是(N,1)
时,Sci-kit学习中的sklearn.cross_decomposition.PLSSVD
类似乎失败了,其中N
是数据集中的样本数.
The sklearn.cross_decomposition.PLSSVD
class in Sci-kit learn appears to be failing when the response variable has a shape of (N,)
instead of (N,1)
, where N
is the number of samples in the dataset.
但是,当响应变量的形状为(N,1)
而不是(N,)
时,sklearn.cross_validation.cross_val_score
失败.我如何一起使用它们?
However, sklearn.cross_validation.cross_val_score
fails when the response variable has a shape of (N,1)
instead of (N,)
. How can I use them together?
一段代码:
from sklearn.pipeline import Pipeline
from sklearn.cross_decomposition import PLSSVD
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# x -> (N, 60) numpy array
# y -> (N, ) numpy array
# These are the classifier 'pieces' I'm using
plssvd = PLSSVD(n_components=5, scale=False)
logistic = LogisticRegression(penalty='l2', C=0.5)
scaler = StandardScaler(with_mean=True, with_std=True)
# Here's the pipeline that's failing
plsclf = Pipeline([('scaler', scaler),
('plssvd', plssvd),
('logistic', logistic)])
# Just to show how I'm using the pipeline for a working classifier
logclf = Pipeline([('scaler', scaler),
('logistic', logistic)])
##################################################################
# This works fine
log_scores = cross_validation.cross_val_score(logclf, x, y, scoring='accuracy',
verbose=True, cv=5, n_jobs=4)
# This fails!
pls_scores = cross_validation.cross_val_score(plsclf, x, y, scoring='accuracy',
verbose=True, cv=5, n_jobs=4)
特别是,它在第103行:y_std = np.ones(Y.shape[1])
的'IndexError: tuple index out of range'
中,在cross_decomposition/pls_.pyc
的_center_scale_xy
函数中失败,因为形状元组只有一个元素.
Specifically, it fails in the _center_scale_xy
function of cross_decomposition/pls_.pyc
with 'IndexError: tuple index out of range'
at line 103: y_std = np.ones(Y.shape[1])
, because the shape tuple has only one element.
如果我在PLSSVD
构造函数中设置了scale=True
,则它在第99行的同一函数中失败:y_std[y_std == 0.0] = 1.0
,因为它试图对浮点数进行布尔索引(y_std
是浮点数,因为它只有一个维度).
If I set scale=True
in the PLSSVD
constructor, it fails in the same function at line 99: y_std[y_std == 0.0] = 1.0
, because it is attempting to do a boolean index on a float (y_std
is a float, since it only has one dimension).
似乎就像一个简单的解决方法一样,只需确保y
变量具有两个维度,即(N,1)
. 但是:
Seems, like an easy fix, just make sure the y
variable has two dimensions, (N,1)
. However:
如果我在输出变量y
中创建尺寸为(N,1)
的数组,它仍然会失败.为了更改数组,我在运行cross_val_score
之前将其添加:
If I create an array with dimensions (N,1)
out of the output variable y
, it still fails. In order to change the arrays, I add this before running cross_val_score
:
y = np.transpose(np.array([y]))
然后,它在sklearn/cross_validation.py
的第398行中失败:
Then, it fails in sklearn/cross_validation.py
at line 398:
File "my_secret_script.py", line 293, in model_create
scores = cross_validation.cross_val_score(plsclf, x, y, scoring='accuracy', verbose=True, cv=5, n_jobs=4)
File "/Users/my.secret.name/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1129, in cross_val_score
cv = _check_cv(cv, X, y, classifier=is_classifier(estimator))
File "/Users/my.secret.name/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1216, in _check_cv
cv = StratifiedKFold(y, cv, indices=needs_indices)
File "/Users/my.secret.name/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 398, in __init__
label_test_folds = test_folds[y == label]
ValueError: boolean index array should have 1 dimension
我正在OSX,NumPy版本1.8.0
,Sci-kit Learn版本0.15-git
上运行此程序.
I'm running this on OSX, NumPy version 1.8.0
, Sci-kit Learn version 0.15-git
.
是否可以将PLSSVD
与cross_val_score
一起使用?
Any way to use PLSSVD
together with cross_val_score
?
推荐答案
偏最小二乘将您的数据X
和目标Y
都投影到每个由n_components
矢量跨越的线性子空间中.预测它们的方式是使一个预测变量在另一个变量上的回归得分最大化.
Partial least squares projects both your data X
and your target Y
onto linear subspaces spanned by n_components
vectors each. They are projected in a way that regression scores of one projected variable on the other are maximized.
组件的数量(即潜在子空间的尺寸)受变量中要素数量的限制.变量Y
仅具有一个特征(一列),因此潜在子空间是一维的,有效地将构造简化为类似于(但不完全相同)线性回归的形式.因此,在这种特定情况下使用偏最小二乘法可能没有用.
The number of components, i.e. dimensions of the latent subspaces is bounded by the number of features in your variables. Your variable Y
only has one feature (one column), so the latent subspace is one-dimensional, effectively reducing your construction to something more akin to (but not exactly the same as) linear regression. So using partial least squares in this specific situation is probably not useful.
看看以下内容
import numpy as np
rng = np.random.RandomState(42)
n_samples, n_features_x, n_features_y, n_components = 20, 10, 1, 1
X = rng.randn(n_samples, n_features_x)
y = rng.randn(n_samples, n_features_y)
from sklearn.cross_decomposition import PLSSVD
plssvd = PLSSVD(n_components=n_components)
X_transformed, Y_transformed = plssvd.fit_transform(X, y)
X_transformed
和Y_transformed
是形状为n_samples, n_components
的数组,它们是X
和Y
的投影版本.
X_transformed
and Y_transformed
are arrays of shape n_samples, n_components
, they are the projected versions of X
and Y
.
关于在cross_val_score
中的Pipeline
中使用PLSSVD
的问题的答案为否,因为Pipeline
对象调用fit
和transform
尽可能使用变量X
和Y
作为参数,正如您在我编写的代码中看到的那样,它们返回一个包含计划的X
和Y
值.管道中的下一步将无法处理此问题,因为它将认为此元组是新的X
.
The answer to your question about using PLSSVD
within a Pipeline
in cross_val_score
, is no, it will not work out of the box, because the Pipeline
object calls fit
and transform
using both variables X
and Y
as arguments if possible, which, as you can see in the code I wrote, returns a tuple containing the projected X
and Y
values. The next step in the pipeline will not be able to process this, because it will think that this tuple is the new X
.
这种类型的故障是由于以下事实造成的:sklearn
才刚刚开始系统地涉及多个目标支持.您尝试使用的PLSSVD
估算器本质上是多个目标,即使您仅在一个目标上使用它.
This type of failure is due to the fact that sklearn
is only beginning to be systematic about multiple target support. The PLSSVD
estimator you are trying to use is inherently multi target, even if you are only using it on one target.
解决方案:不要在一维目标上使用偏最小二乘,即使它与管道一起工作也不会有收益.
Solution: Don't use partial least squares on 1D targets, there would be no gain to it even if it worked with the pipeline.
这篇关于Sci-kit学习PLS SVD和交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!