sklearn的工作数量增加导致培训缓慢 [英] sklearn increasing number of jobs leads to slow training

查看:73
本文介绍了sklearn的工作数量增加导致培训缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试让sklearn在gridsearch期间使用更多的cpu内核(在Windows计算机上执行此操作).代码是这样的:

I've been trying to get sklearn to use more cpu cores during gridsearch (doing this on a Windows machine). Code is this:

parameters = {'n_estimators': numpy.arange(1,10), 'max_depth':numpy.arange(1,10)}

estimator = RandomForestClassifier(verbose=1)

clf = grid_search.GridSearchCV(estimator, parameters, n_jobs=-1)
clf.fit(features_train, labels_train)

我正在一个只有 100 个样本的小数据集上对此进行测试.

I'm testing this on a small dataset of only 100 samples.

当n_jobs设置为1(默认值)时,一切正常进行并快速完成.但是,它仅使用1个cpu内核.

When n_jobs is set to 1 (default), everything proceeds as normal and finishes quickly. However, it only uses 1 cpu core.

在上面,我将n_jobs设置为-1以使用所有cpu内核.当我这样做时(或者如果我使用任何大于1的值),我可以看到我的机器上使用了正确数量的内核,但是速度非常慢.在n_jobs = 1的情况下,培训将在大约10秒内完成.大于1的任何东西,训练都可能需要5-10分钟.

In the above, I set n_jobs to -1 to use all cpu cores. When I do that (or if I use any value > 1) I can see that the correct number of cores are being utilized on my machine, but the speed is extremely extremely slow. With n_jobs = 1, the training finishes in about 10 seconds. With anything > 1, training can take 5-10minutes.

增加gridsearch使用的核心数的正确方法是什么?

What is the correct way to increase the number of cores being used by gridsearch?

推荐答案

我怀疑这可能与您仅使用100个样本的小型数据集进行测试这一事实有关-也许它并不大足以证明并行化的开销.

My suspicion is that this could be related to the fact that you're only testing it with a small dataset of 100 samples - perhaps it just isn't big enough to justify the overhead of parallelization.

对于较大的数据集,并行模式应优于n_jobs = 1方法.您是否尝试过针对更大的样本进行测试?

It should be that for a significantly larger dataset the parallel mode will outperform the n_jobs = 1 approach. Have you tried testing this against a much larger sample?

这篇关于sklearn的工作数量增加导致培训缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆