如何使用 scikit 的预处理/标准化以及交叉验证? [英] How to use scikit's preprocessing/normalization along with cross validation?
问题描述
作为没有任何预处理的交叉验证的例子,我可以这样做:
tuned_params = [{"penalty" : ["l2", "l1"]}]从 sklearn.linear_model 导入 SGDClassifierSGD = SGDClassifier()从 sklearn.grid_search 导入 GridSearchCVclf = GridSearchCV(myClassifier, params,verbose=5)clf.fit(x_train, y_train)
我想使用类似
的东西预处理我的数据from sklearn 导入预处理x_scaled = preprocessing.scale(x_train)
但是在设置交叉验证之前这样做并不是一个好主意,因为那样训练集和测试集将一起被标准化.如何设置交叉验证以在每次运行时分别预处理相应的训练和测试集?
根据文档,如果您使用 Pipeline
,这可以为您完成.来自 docs,就在第 3.1.1.1 节之上,强调我的:
正如在训练中保留的数据上测试预测器很重要一样,预处理(例如标准化、特征选择等)和类似的数据转换同样应该从训练集中学习并应用于用于预测的保留数据 [...] 管道使组合估计器更容易,在交叉验证下提供这种行为[.]
有关可用管道的更多相关信息此处.>
As an example of cross-validation without any preprocessing, I can do something like this:
tuned_params = [{"penalty" : ["l2", "l1"]}]
from sklearn.linear_model import SGDClassifier
SGD = SGDClassifier()
from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(myClassifier, params, verbose=5)
clf.fit(x_train, y_train)
I would like to preprocess my data using something like
from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)
But it would not be a good idea to do this before setting the cross validation, because then the training and testing sets will be normalized together. How do I setup the cross validation to preprocess the corresponding training and test sets separately on each run?
Per the documentation, if you employ Pipeline
, this can be done for you. From the docs, just above section 3.1.1.1, emphasis mine:
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]
More relevant information on pipelines available here.
这篇关于如何使用 scikit 的预处理/标准化以及交叉验证?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!