SVM 内核的速度?线性 vs RBF vs Poly [英] Speed of SVM Kernels? Linear vs RBF vs Poly

查看:145
本文介绍了SVM 内核的速度?线性 vs RBF vs Poly的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Python 中使用 scikitlearn 来创建一些 SVM 模型,同时尝试不同的内核.代码很简单,格式如下:

I'm using scikitlearn in Python to create some SVM models while trying different kernels. The code is pretty simple, and follows the form of:

from sklearn import svm
clf = svm.SVC(kernel='rbf', C=1, gamma=0.1) 
clf = svm.SVC(kernel='linear', C=1, gamma=0.1) 
clf = svm.SVC(kernel='poly', C=1, gamma=0.1) 
t0 = time()
clf.fit(X_train, y_train)
print "Training time:", round(time() - t0, 3), "s"
pred = clf.predict(X_test)

数据是 8 个特征和 3000 多个观察值.我很惊讶地看到 rbf 的安装时间不到一秒,而 linear 需要 90 秒,poly 需要几个小时.

The data is 8 features and a little over 3000 observations. I was surprised to see that rbf was fitted in under a second, whereas linear took 90 seconds and poly took hours.

我认为非线性内核会更复杂并且需要更多时间.线性比 rbf 花费的时间长得多,而 poly 的时间比两者都长得多,这是有原因的吗?它会因我的数据而有显着差异吗?

I assumed that the non-linear kernels would be more complicated and take more time. Is there a reason the linear is taking so much longer than rbf, and that poly is taking so much longer than both? Can it vary dramatically based on my data?

推荐答案

您是否扩展了数据?

这可能会成为 SVM 的问题.根据支持向量分类实用指南

This can become an issue with SVM's. According to A Practical Guide to Support Vector Classification

因为核值通常取决于特征向量,例如线性核和多项式核,大的属性值可能会导致数值问题.

Because kernel values usually depend on the inner products of feature vectors, e.g. the linear kernel and the polynomial kernel, large attribute values might cause numerical problems.

现在举个例子,我将使用 sklearn 乳腺癌数据集:

Now for an example, I will use the sklearn breast cancer dataset:

from time import time

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

clf_lin = SVC(kernel='linear', C=1.0, gamma=0.1)
clf_rbf = SVC(kernerl='rbf', C=1.0, gamma=0.1)

start = time()
clf_lin.fit(X_train, y_train)
print("Linear Kernel Non-Normalized Fit Time: {0.4f} s".format(time() - start))
start = time()
clf_rbf.fit(X_train, y_train)
print("RBF Kernel Non-Normalized Fit Time: {0.4f} s".format(time() - start))

scaler = MinMaxScaler()  # Default behavior is to scale to [0,1]
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y)

start = time()
clf_lin.fit(X_train, y_train)
print("Linear Kernel Normalized Fit Time: {0.4f} s".format(time() - start))
start = time()
clf_rbf.fit(X_train, y_train)
print("RBF Kernel Normalized Fit Time: {0.4f} s".format(time() - start))

输出:

Linear Kernel Non-Normalized Fit Time: 0.8672
RBF Kernel Non-Normalized Fit Time: 0.0124
Linear Kernel Normalized Fit Time: 0.0021
RBF Kernel Normalized Fit Time: 0.0039

所以你可以看到,在这个形状为 (560, 30) 的数据集中,我们通过一点点缩放获得了相当大的性能提升.

So you can see that in this dataset with shape (560, 30) we get a pretty drastic improvement in performance from a little scaling.

此行为取决于具有较大值的特征.考虑在无限维空间中工作.随着您填充无限维空间的值越来越大,它们的多维产品之间的空间会很多变大.我怎么强调太多都不过分.阅读维度诅咒,并且阅读的不仅仅是我链接的维基条目.这个间隔使这个过程需要更长的时间.试图在这个巨大空间中分离类背后的数学变得更加复杂,尤其是随着特征和观察数量的增加.因此,始终扩展您的数据至关重要.即使您只是在做一个简单的线性回归,这也是一个很好的做法,因为您将消除对具有较大值的特征的任何可能的偏见.

This behavior is dependent upon the features with large values. Think about working in infinitely dimensional space. As the values that you populate that infinitely dimensional space with get larger the space between their multidimensional products gets a lot bigger. I cannot stress that a lot enough. Read about The Curse of Dimensionality, and do read more than just the wiki entry I linked. This spacing is what makes the process take longer. The mathematics behind trying to separate the classes in this massive space just get drastically more complex, especially as the number of features and observations grow. Thus it is critical to always scale your data. Even if you are just doing a simple linear regression it is a good practice as you will remove any possible bias towards features with larger values.

这篇关于SVM 内核的速度?线性 vs RBF vs Poly的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆