使 SVM 在 Python 中运行得更快 [英] Making SVM run faster in python

查看:52
本文介绍了使 SVM 在 Python 中运行得更快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

代码用于python中的svm:

Using the code below for svm in python:

from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))
clf.fit(X, y)
proba = clf.predict_proba(X)

但这需要大量时间.

实际数据维度:

train-set (1422392,29)
test-set (233081,29)

我怎样才能加快速度(并行或其他方式)?请帮忙.我已经尝试过 PCA 和下采样.

How can I speed it up(parallel or some other way)? Please help. I have already tried PCA and downsampling.

我有 6 个班级.找到 http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html但我希望进行概率估计,而 svm 似乎并非如此.

I have 6 classes. Found http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html but I wish for probability estimates and it seems not to so for svm.

from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC,LinearSVC
from sklearn.linear_model import SGDClassifier
import joblib
import numpy as np
from sklearn import grid_search
import multiprocessing
import numpy as np
import math

def new_func(a):                              #converts array(x) elements to (1/(1 + e(-x)))
    a=1/(1 + math.exp(-a))
    return a

if __name__ == '__main__':
    iris = datasets.load_iris()
    cores=multiprocessing.cpu_count()-2
    X, y = iris.data, iris.target                       #loading dataset

    C_range = 10.0 ** np.arange(-4, 4);                  #c value range 
    param_grid = dict(estimator__C=C_range.tolist())              

    svr = OneVsRestClassifier(LinearSVC(class_weight='auto'),n_jobs=cores) ################LinearSVC Code faster        
    #svr = OneVsRestClassifier(SVC(kernel='linear', probability=True,  ##################SVC code slow
    #   class_weight='auto'),n_jobs=cores)

    clf = grid_search.GridSearchCV(svr, param_grid,n_jobs=cores,verbose=2)  #grid search
    clf.fit(X, y)                                                   #training svm model                                     

    decisions=clf.decision_function(X)                             #outputs decision functions
    #prob=clf.predict_proba(X)                                     #only for SVC outputs probablilites
    print decisions[:5,:]
    vecfunc = np.vectorize(new_func)
    prob=vecfunc(decisions)                                        #converts deicision to (1/(1 + e(-x)))
    print prob[:5,:]

编辑 2:用户 3914041 的答案产生的概率估计非常差.

Edit 2: The answer by user3914041 yields very poor probability estimates.

推荐答案

如果您想尽可能坚持使用 SVC 并在完整数据集上进行训练,您可以使用在数据子集上训练的 SVC 集合来减少每个分类器的记录数(这显然对复杂性具有二次影响).Scikit 通过 BaggingClassifier 包装器支持这一点.与单个分类器相比,这应该为您提供相似(如果不是更好)的准确度,并且训练时间要少得多.还可以使用 n_jobs 参数将各个分类器的训练设置为并行运行.

If you want to stick with SVC as much as possible and train on the full dataset, you can use ensembles of SVCs that are trained on subsets of the data to reduce the number of records per classifier (which apparently has quadratic influence on complexity). Scikit supports that with the BaggingClassifier wrapper. That should give you similar (if not better) accuracy compared to a single classifier, with much less training time. The training of the individual classifiers can also be set to run in parallel using the n_jobs parameter.

或者,我也会考虑使用随机森林分类器 - 它本身支持多类分类,速度快,并且当 min_samples_leaf 设置适当时,可以提供非常好的概率估计.

Alternatively, I would also consider using a Random Forest classifier - it supports multi-class classification natively, it is fast and gives pretty good probability estimates when min_samples_leaf is set appropriately.

我对 iris 数据集进行了快速测试,该数据集包含 10 个 SVC,每个 SVC 都使用了 10% 的数据进行了训练.它比单个分类器快 10 倍以上.这些是我在笔记本电脑上得到的数字:

I did a quick tests on the iris dataset blown up 100 times with an ensemble of 10 SVCs, each one trained on 10% of the data. It is more than 10 times faster than a single classifier. These are the numbers I got on my laptop:

单个 SVC:45 秒

Single SVC: 45s

合奏 SVC:3 秒

随机森林分类器:0.5s

Random Forest Classifier: 0.5s

见下面我用来生成数字的代码:

See below the code that I used to produce the numbers:

import time
import numpy as np
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

iris = datasets.load_iris()
X, y = iris.data, iris.target

X = np.repeat(X, 100, axis=0)
y = np.repeat(y, 100, axis=0)
start = time.time()
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))
clf.fit(X, y)
end = time.time()
print "Single SVC", end - start, clf.score(X,y)
proba = clf.predict_proba(X)

n_estimators = 10
start = time.time()
clf = OneVsRestClassifier(BaggingClassifier(SVC(kernel='linear', probability=True, class_weight='auto'), max_samples=1.0 / n_estimators, n_estimators=n_estimators))
clf.fit(X, y)
end = time.time()
print "Bagging SVC", end - start, clf.score(X,y)
proba = clf.predict_proba(X)

start = time.time()
clf = RandomForestClassifier(min_samples_leaf=20)
clf.fit(X, y)
end = time.time()
print "Random Forest", end - start, clf.score(X,y)
proba = clf.predict_proba(X)

如果您想确保每条记录仅用于 BaggingClassifier 中的训练一次,您可以将 bootstrap 参数设置为 False.

If you want to make sure that each record is used only once for training in the BaggingClassifier, you can set the bootstrap parameter to False.

这篇关于使 SVM 在 Python 中运行得更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆