带有标准缩放器、PCA & 的管道上的 GridSearchCV套索 [英] GridSearchCV on a pipeline with standardscaler, PCA & lasso
问题描述
假设我正在使用 [StandardScaler、PCA &Lasso],其中 PCA 参数的网格搜索超过 2 个值,Lasso 参数的网格搜索超过 3 个值(因此有 6 种可能的参数组合).在进行 CV 时,对于给定的折叠,算法是否仅标准化该折叠中的训练集(即不包括用于确定标准化器均值/方差的折叠测试集),或者是否标准化整个折叠外的数据集(在这种情况下,整个网格搜索过程只进行了一个标准化)?
如果您使用的是包含 sklearn.preprocessing.StandardScaler
的 sklearn.pipeline.Pipeline
对象,一个 sklearn.decomposition.PCA
和一个 sklearn.linear_model.Lasso
,并使用这个管道使用 GridSearchCV
制作一个交叉验证的估计器,然后StandardScaler
将仅在内部训练折叠上估计用于居中和重新缩放到单位方差的参数.
在测试折叠上评估管道时,StandardScaler
将使用存储的均值和标准差,并从测试集中减去训练均值,并将结果除以训练标准差.>
所以答案是:否,StandardScaler
不会以任何方式使用测试集来确定数据的均值和方差.
Assume that I am doing GridSearchCV on a pipeline with [StandardScaler, PCA & Lasso], where the grid search is over 2 values for a PCA parameter and 3 values for a Lasso parameter (thus 6 possible parameter combinations). When doing CV, for a given fold does the algorithm standardize only the train set in that fold (i.e., not include the fold's test set for determining mean/variance of the standardizer) or does it standardize the entire data set outside of the folds (in which case there is only one Standardizing done for the entire grid search procedure)?
If you are using a sklearn.pipeline.Pipeline
object containing a sklearn.preprocessing.StandardScaler
, a sklearn.decomposition.PCA
and a sklearn.linear_model.Lasso
, and use this pipeline to make a cross-validated estimator using GridSearchCV
, then the StandardScaler
will estimate the parameters for centering and rescaling to unit variance only on the internal train fold.
When evaluating the pipeline on the test fold, the StandardScaler
will use the stored means and standard deviations and subtract the train mean from the test set and divide the result by the train standard deviation.
So the answer is: No, the StandardScaler
will not use the test set in any way to determine mean and variance of the data.
这篇关于带有标准缩放器、PCA & 的管道上的 GridSearchCV套索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!