如何在scikit-learn中将标准化应用于SVM? [英] How to apply standardization to SVMs in scikit-learn?

查看:114
本文介绍了如何在scikit-learn中将标准化应用于SVM?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scikit-learn的当前稳定版本0.13。我正在使用类 sklearn.svm.LinearSVC

I'm using the current stable version 0.13 of scikit-learn. I'm applying a linear support vector classifier to some data using the class sklearn.svm.LinearSVC.

有关预处理的章节,我已阅读以下内容:

In the chapter about preprocessing in scikit-learn's documentation, I've read the following:


学习算法的目标函数中使用的许多元素(例如Support Vector Machines的RBF内核或线性模型的l1和l2正则化函数)都假定所有要素均以零为中心,并具有相同顺序的方差。如果某个特征的方差比其他特征大几个数量级,则它可能会支配目标函数,并使估计器无法按预期正确地从其他特征中学习。

Many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

问题1:标准化通常对SVM有用吗,对于像我一样具有线性内核功能的SVM也有用吗?

Question 1: Is standardization useful for SVMs in general, also for those with a linear kernel function as in my case?

问题2:据我了解,我必须计算训练数据的均值和标准差,并使用类 sklearn.preprocessing.StandardScaler 。但是,我不了解的是在将数据输入SVM分类器之前是否必须同时转换训练数据还是仅转换测试数据。

Question 2: As far as I understand, I have to compute the mean and standard deviation on the training data and apply this same transformation on the test data using the class sklearn.preprocessing.StandardScaler. However, what I don't understand is whether I have to transform the training data as well or just the test data prior to feeding it to the SVM classifier.

也就是说,我必须这样做:

That is, do I have to do this:

scaler = StandardScaler()
scaler.fit(X_train)                # only compute mean and std here
X_test = scaler.transform(X_test)  # perform standardization by centering and scaling

clf = LinearSVC()
clf.fit(X_train, y_train)
clf.predict(X_test)

还是我必须这样做:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # compute mean, std and transform training data as well
X_test = scaler.transform(X_test)  # same as above

clf = LinearSVC()
clf.fit(X_train, y_train)
clf.predict(X_test)

简而言之,我有在训练数据上使用 scaler.fit(X_train) scaler.fit_transform(X_train)以获得合理的训练数据 LinearSVC

In short, do I have to use scaler.fit(X_train) or scaler.fit_transform(X_train) on the training data in order to get reasonable results with LinearSVC?

推荐答案

都不是。

scaler.transform(X_train)无效。 transform 操作不就位。
您必须要做

scaler.transform(X_train) doesn't have any effect. The transform operation is not in-place. You have to do

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

X_train = scaler.fit(X_train).transform(X_train)

您总是需要对训练或测试数据进行相同的预处理。是的,如果标准化反映出您对数据的信念,那么标准化总是好的。
特别是对于kernel-svms来说,至关重要。

You always need to do the same preprocessing on both training or test data. And yes, standardization is always good if it reflects your believe for the data. In particular for kernel-svms it is often crucial.

这篇关于如何在scikit-learn中将标准化应用于SVM?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆