如何优化SciKit一类培训时间? [英] How to optimize SciKit one-class training time?

查看:94
本文介绍了如何优化SciKit一类培训时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,我的问题与 SciKit一类SVM分类器的训练时间随训练数据的大小呈指数增加,但没人能解决这个问题。

Essentially my questions is the same as SciKit One-class SVM classifier training time increases exponentially with size of training data, but no one has figured out the problem.

几万个就可以了,但是几十万个要花很长时间。我想以数千万的价格运行它,但我不想等待一天半(甚至更多)的时间没有任何希望。

It seems to run fine for somewhere in the 10s of thousands, but 100s of thousands take very long. And I want to run it on 10s of millions, but I don't want to wait a day and a half (maybe even more) for nothing to come of it. Is there a faster way about it, or should I use something else?

推荐答案

非常 >在这个领域的初中,所以带一点盐。

I'm very junior in this field, so take this with a grain of salt.

隔离林似乎是异常检测的有效解决方案。它们已被证明与其他流行算法相比表现良好[Liu,2008]。此外,根据scikit learning,一类SVM在某种程度上容易受到异常影响。第1类的异常可能与第2类重叠,并导致数据贴错标签……也许获取样本的子集并使用它们创建一组SVM可以避免这种情况(并且仍然可以节省您的时间,具体取决于

Isolation Forests appear to be an efficient solution for outlier detection. They have been shown to perform well against other popular algorithms [Liu, 2008]. Also, One-class SVMs are somewhat susceptible to anomalies according to scikit learn. The anomalies in your Class 1 could overlap with Class 2 and cause data to be mislabeled... perhaps taking subsets of your samples and using them to create an ensemble of SVMs could avoid this (and still save you time, depending on the size of the subsets), but Isolation Forests naturally do this.

为进一步阅读,这似乎是有关
http://www.robots.ox.ac.uk/~davidc/pubs/NDreview2014。 pdf

For further reading, this seems like a good reference paper on the topic http://www.robots.ox.ac.uk/~davidc/pubs/NDreview2014.pdf

它提到了可能适用于您的情况的聚类和距离方法。我认为最好阅读很多书,并确保您了解算法的不同优势/劣势。特别是因为我正在这样做,即使我知道您的问题的具体情况,也确实无法给出可靠的建议。

It mentions clustering and distance methods which may be applicable in your case. I think it's best to do a lot of reading and make sure you understand the different strengths/weaknesses of the algorithms. Especially since I'm in the process of doing that and really can't give solid advice even if I knew the specifics of your problem.

请注意基于距离的算法。我知道有些是经过优化的,但是我认为普遍的抱怨是它们具有很高的计算复杂度。许多基于聚类/距离/概率的算法在处理高维数据时也存在弱点。

Note re:distance based algorithms. I know some are optimized, but I think the general complaint is that they have high computation complexity. Many clustering/distance/probability based algorithms also have weaknesses dealing with high dimensionality data.

这篇关于如何优化SciKit一类培训时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆