如何使用sklearn的cross_val_score()标准化数据 [英] How to standardize data with sklearn's cross_val_score()

查看:456
本文介绍了如何使用sklearn的cross_val_score()标准化数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我要使用LinearSVC对数据集执行k折交叉验证。我将如何对数据进行标准化?

Let's say I want to use a LinearSVC to perform k-fold-cross-validation on a dataset. How would I perform standardization on the data?

我读到的最佳实践是在训练数据上建立标准化模型,然后将此模型应用于测试数据。

The best practice I have read is to build your standardization model on your training data then apply this model to the testing data.

使用简单的train_test_split()时,这很容易,因为我们可以这样做:

When one uses a simple train_test_split(), this is easy as we can just do:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

clf = svm.LinearSVC()

scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.transform(X_test)

clf.fit(X_train, y_train)
predicted = clf.predict(X_test)

在进行k倍交叉验证的同时如何标准化数据?问题出在每个数据点都用于训练/测试,因此您无法在cross_val_score()之前将所有数据标准化。每个交叉验证是否都需要不同的标准化?

How would one go about standardizing data while doing k-fold-cross-validation? The problem comes from the fact that every data point will be for training/testing so you cannot standardize everything before cross_val_score(). Wouldn't you need a different standardization for each cross validation?

文档没有提到函数内部发生的标准化。我是SOL吗?

The docs do not mention standardization happening internally within the function. Am I SOL?

编辑:这篇文章超级有用: Python-sklearn.pipeline.Pipeline到底是什么?

This post is super helpful: Python - What is exactly sklearn.pipeline.Pipeline?

推荐答案

您可以使用< a href = http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html rel = noreferrer>管道将这两个过程组合在一起,然后发送到cross_val_score()。

You can use a Pipeline to combine both of the processes and then send it into the cross_val_score().

在管道上调用 fit()时,它将适合所有变换一个接一个地变换数据,然后使用最终估算器拟合变换后的数据。并且在 predict()期间(仅在管道中的最后一个对象是估计量的情况下可用,否则,在 transform()中可用)

When the fit() is called on the pipeline, it will fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. And during predict() (Only available if last object in pipeline is an estimator, otherwise transform()) it will apply transforms to the data, and predict with the final estimator.

像这样:

scalar = StandardScaler()
clf = svm.LinearSVC()

pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])

cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)

查看各种管道示例以更好地理解它:

Check out various examples of pipeline to understand it better:

  • http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#examples-using-sklearn-pipeline-pipeline

随时问是否有疑问。

这篇关于如何使用sklearn的cross_val_score()标准化数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆