如何将标准化应用于 scikit-learn 中的 SVM? [英] How to apply standardization to SVMs in scikit-learn?

查看:37
本文介绍了如何将标准化应用于 scikit-learn 中的 SVM?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scikit-learn 的当前稳定版本 0.13.我正在使用 sklearn.svm.LinearSVC.

I'm using the current stable version 0.13 of scikit-learn. I'm applying a linear support vector classifier to some data using the class sklearn.svm.LinearSVC.

在 scikit-learn 文档的关于预处理的章节中,我已经阅读以下内容:

In the chapter about preprocessing in scikit-learn's documentation, I've read the following:

在学习算法的目标函数中使用的许多元素(例如支持向量机的 RBF 内核或线性模型的 l1 和 l2 正则化器)假设所有特征都以零为中心并且具有相同顺序的方差.如果一个特征的方差比其他特征大几个数量级,它可能会支配目标函数并使估计器无法按预期正确地从其他特征中学习.

Many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

问题 1:标准化对于一般的 SVM 有用吗?对于我这种具有线性核函数的 SVM 也有用吗?

Question 1: Is standardization useful for SVMs in general, also for those with a linear kernel function as in my case?

问题 2:据我所知,我必须计算训练数据的均值和标准差,并使用 sklearn.preprocessing.StandardScaler.但是,我不明白的是,在将训练数据提供给 SVM 分类器之前,我是否必须同时转换训练数据或仅转换测试数据.

Question 2: As far as I understand, I have to compute the mean and standard deviation on the training data and apply this same transformation on the test data using the class sklearn.preprocessing.StandardScaler. However, what I don't understand is whether I have to transform the training data as well or just the test data prior to feeding it to the SVM classifier.

也就是说,我必须这样做吗:

That is, do I have to do this:

scaler = StandardScaler()
scaler.fit(X_train)                # only compute mean and std here
X_test = scaler.transform(X_test)  # perform standardization by centering and scaling

clf = LinearSVC()
clf.fit(X_train, y_train)
clf.predict(X_test)

或者我必须这样做:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # compute mean, std and transform training data as well
X_test = scaler.transform(X_test)  # same as above

clf = LinearSVC()
clf.fit(X_train, y_train)
clf.predict(X_test)

简而言之,我是否必须在训练数据上使用 scaler.fit(X_train)scaler.fit_transform(X_train) 才能获得合理的结果 <代码>LinearSVC?

In short, do I have to use scaler.fit(X_train) or scaler.fit_transform(X_train) on the training data in order to get reasonable results with LinearSVC?

推荐答案

两者都不是.

scaler.transform(X_train) 没有任何效果.transform 操作不是就地的.你必须做的

scaler.transform(X_train) doesn't have any effect. The transform operation is not in-place. You have to do

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

X_train = scaler.fit(X_train).transform(X_train)

您始终需要对训练数据或测试数据进行相同的预处理.是的,如果标准化反映了您对数据的信念,那么它总是好的.特别是对于内核支持向量机来说,它通常是至关重要的.

You always need to do the same preprocessing on both training or test data. And yes, standardization is always good if it reflects your believe for the data. In particular for kernel-svms it is often crucial.

这篇关于如何将标准化应用于 scikit-learn 中的 SVM?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆