SVC分类器花费太多时间进行训练 [英] SVC classifier taking too much time for training

查看:609
本文介绍了SVC分类器花费太多时间进行训练的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用带有线性内核的SVC分类器来训练我的模型. 火车数据:42000条记录

    model = SVC(probability=True)
    model.fit(self.features_train, self.labels_train)
    y_pred = model.predict(self.features_test)
    train_accuracy = model.score(self.features_train,self.labels_train)
    test_accuracy = model.score(self.features_test, self.labels_test)

训练我的模型需要2个多小时. 难道我做错了什么? 另外,可以采取什么措施来缩短时间

预先感谢

解决方案

有几种方法可以加快您的SVM培训.假设n为记录数,并且d为嵌入维数.我假设您使用scikit-learn.

  • 减小训练集大小.引用 docs :

    拟合时间复杂度是样本数的两倍以上,这使得很难扩展到具有超过10000个样本的数据集.

    O(n^2)复杂性很可能将主导其他因素.因此,为培训减少采样记录将对时间产生最大的影响.除了随机抽样外,您还可以尝试实例选择方法.例如,已经主要样本分析最近提出的.

  • 缩小尺寸.正如其他人在评论中所暗示的那样,嵌入维度也会影响运行时.计算线性内核的内积在O(d)中.因此,降维还可减少运行时间.在另一个问题中,建议专门针对TF-IDF表示法的潜在语义索引. /p>

  • 参数.除非您需要概率,否则请使用SVC(probability=False),因为它们会降低该方法的速度."(来自文档).
  • 实施.据我所知,scikit-learn仅包含LIBSVM和LIBLINEAR.我在这里推测,但是您可以通过使用高效的BLAS库(例如英特尔的MKL)来加快这一步.
  • 不同的分类器.您可以尝试sklearn.svm.LinearSVC,这是...

    [s]与SVC类似,参数kernel ='linear',但以liblinear而不是libsvm的方式实现,因此它在选择罚分和损失函数时具有更大的灵活性,应该更好地扩展到大量样本.

    此外,一个scikit-learn开发人员建议使用 kernel_approximation 模块在类似问题中.

I am using SVC classifier with Linear kernel to train my model. Train data: 42000 records

    model = SVC(probability=True)
    model.fit(self.features_train, self.labels_train)
    y_pred = model.predict(self.features_test)
    train_accuracy = model.score(self.features_train,self.labels_train)
    test_accuracy = model.score(self.features_test, self.labels_test)

It takes more than 2 hours to train my model. Am I doing something wrong? Also, what can be done to improve the time

Thanks in advance

解决方案

There are several possibilities to speed up your SVM training. Let n be the number of records, and d the embedding dimensionality. I assume you use scikit-learn.

  • Reducing training set size. Quoting the docs:

    The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.

    O(n^2) complexity will most likely dominate other factors. Sampling fewer records for training will thus have the largest impact on time. Besides random sampling, you could also try instance selection methods. For example, principal sample analysis has been proposed recently.

  • Reducing dimensionality. As others have hinted at in their comments, embedding dimension also impacts runtime. Computing inner products for the linear kernel is in O(d). Dimensionality reduction can, therefore, also reduce runtime. In another question, latent semantic indexing was suggested specifically for TF-IDF representations.

  • Parameters. Use SVC(probability=False) unless you need the probabilities, because they "will slow down that method." (from the docs).
  • Implementation. To the best of my knowledge, scikit-learn just wraps around LIBSVM and LIBLINEAR. I am speculating here, but you may be able to speed this up by using efficient BLAS libraries, such as in Intel's MKL.
  • Different classifier. You may try sklearn.svm.LinearSVC, which is...

    [s]imilar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

    Moreover, a scikit-learn dev suggested the kernel_approximation module in a similar question.

这篇关于SVC分类器花费太多时间进行训练的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆