如何在GridSearchCV(...,n_jobs = ...)中找到最佳进程数? [英] How to find an optimum number of processes in GridSearchCV( ..., n_jobs = ... )?

查看:264
本文介绍了如何在GridSearchCV(...,n_jobs = ...)中找到最佳进程数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道,最好与 GridSearchCV( ..., n_jobs = ... ) 一起使用以选择模型的最佳参数集 n_jobs = -1 很大,
n_jobs = 30 吗?

I'm wondering, which is better to use with GridSearchCV( ..., n_jobs = ... ) to pick the best parameter set for a model, n_jobs = -1 or n_jobs with a big number,
like n_jobs = 30 ?

基于Sklearn文档:

Based on Sklearn documentation:

n_jobs = -1表示将在所有 计算机的CPU.

n_jobs = -1 means that the computation will be dispatched on all the CPUs of the computer.

在我的PC上,我有一个Intel i3 CPU,它具有2个内核和4个线程,所以这意味着如果我设置 n_jobs = -1 ,则它隐式等于 吗?

On my PC I have an Intel i3 CPU, which has 2 cores and 4 threads, so does that mean if I set n_jobs = -1, implicitly it will be equal to n_jobs = 2 ?

推荐答案

...的意思是如果我设置n_jobs = -1,则隐式等于n_jobs = 2吗?

... does that mean if I set n_jobs = -1, implicitly it will be equal to n_jobs = 2 ?

这很简单:

python(GridSearchCV()中的scipy/joblib)用于检测CPU内核的数量,如果请求是通过n_jobs = -1设置完成的,则可以合理地安排并发(独立)进程.

This one is easy :

python ( scipy / joblib inside a GridSearchCV() ) used to detect the number of CPU-cores, that is reasonable to schedule concurrent ( independent ) processes, given a request was done with an n_jobs = -1 setting.

看到3-CPU内核很有趣吗?

Funny to see a 3-CPU-core?

在某些可以综合模拟CPU/内核的虚拟机情况下,结果并不像您所知的Intel CPU/i3情况那样简单.

In some virtualised-machine cases, that can synthetically emulate CPU / cores, the results are not as trivial as in your known Intel CPU / i3 case.

如果有疑问,可以用一个琐碎的案例(在确实很小的数据集上,而不是在成熟的模型空间搜索中……)进行测试,然后继续讲故事.证明这一点.

If in doubts, one can test this with a trivialised case ( on an indeed small data-set, not the full-blown model-space search ... ) and let the story go on to prove this.

import psutil;                  print( "{0:17s}{1:} CPUs PHYSICAL".format(
      "psutil:",
       psutil.cpu_count( logical = False ) ) )
pass;                           print( "{0:17s}{1:} CPUs LOGICAL".format(
      "psutil:",
       psutil.cpu_count( logical = True  ) ) )
...

类似的主机平台自我检测"可能会报告有关不同系统/设置的更多详细信息:

A similar host-platform "self-detection" may report more details for different systems / settings:

'''
sys:             linux 
                 3.6.1 (default, Jun 27 2017, 14:35:15)  .. [GCC 7.1.1 20170622 (Red Hat 7.1.1-3)]

multiprocessing: 1 CPU(s)
psutil:          1 CPUs PHYSICAL
psutil:          1 CPUs LOGICAL
psutil:          psutil.cpu_freq(  per_cpu = True  ) not able to report. ?( v5.1.0+ )
psutil:          5.0.1
psutil:          psutil.cpu_times( per_cpu = True  ) not able to report. ?( vX.Y.Z+ )
psutil:          5.0.1
psutil:          svmem(total=1039192064, available=257290240, percent=75.2, used=641396736, free=190361600, active=581107712, inactive=140537856, buffers=12210176, cached=195223552, shared=32768)
numexpr:         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ModuleNotFoundError: No module named 'numexpr'.
joblib:          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ModuleNotFoundError: No module named 'joblib'.
sklearn/joblib:  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ModuleNotFoundError: No module named 'sklearn.externals.joblib' 
'''



Or

''' [i5]
>>> numexpr.print_versions()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Numexpr version:   2.5
NumPy version:     1.10.4
Python version:    2.7.13 |Anaconda 4.0.0 (32-bit)| (default, May 11 2017, 14:07:41) [MSC v.1500 32 bit (Intel)]
AMD/Intel CPU?     True
VML available?     True
VML/MKL version:   Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021 for 32-bit applications
Number of threads used by default: 4 (out of 4 detected cores)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
'''


...最好与GridSearchCV一起使用,以选择模型的最佳参数集
n_jobs = -1n_jobs,且具有较大的数字,例如n_jobs = 30?

... which is better to use with GridSearchCV to pick the best parameter set for a model,
n_jobs = -1 or n_jobs with a big number like n_jobs = 30 ?

关于这个问题,没有简单的"一种尺寸适合所有"的答案:

用于在 n_jobs 指令上生成所需数量的并发流程实例的Scikit工具(以及许多其他遵循此实践的工具)(以便从共享的GIL-锁定步进-如果对详细信息感兴趣,请在其他地方阅读更多内容.

There is no easy "One-Size-Fits-All" answer on this :

The Scikit tools ( and many other followed this practice ) used to spawn, on n_jobs directive being used, a required amount of concurrent process-instances ( so as to escape from shared GIL-lock stepping - read more on this elsewhere if interested in details ).

此过程实例并非免费的(不仅在时间方面(即花费大量的 [TIME] 域成本),而且在空间方面(即花费至少一个) n_jobs -在 [SPACE] -domain)中对单个python进程实例的RAM分配进行计时.

This process-instantiation is not cost-free ( both time-wise, i.e. spending a respectfull amount of the [TIME]-domain costs, but also space-wise, i.e. spending at least an n_jobs-times the RAM-allocations of the single python process-instance in [SPACE]-domain ).

鉴于此,您的战斗就是与一把双刃剑的战斗.

Given this, your fight is a battle against a dual-edged sword.

尝试取消预定" CPU 会使(某些)CPU内核可能处于空闲状态.
尝试超量预订" RAM 空间将使您的性能比预期的差,因为虚拟内存将导致操作系统交换,从而使您的机器学习扩展的数据访问时间从 ~ 10+[ns]慢100,000倍xc ~ 10+ [ms] ,这简直让人不满意.

An attempt to "underbook" CPU will let ( some ) CPU-cores possibly idling.
An attempt to "overbook" RAM-space will turn your performance worse than expected, as virtual-memory will turn operating system swapping, which turns your Machine Learning-scaled data-access times from ~ 10+[ns] more than 100,000 x slower ~ 10+ [ms] which is hardly what one will be pleased at.

n_jobs = a_reasonable_amount_of_processes 的整体效果受 阿姆达尔定律(重新制定的版本,而不是附加的天真的版本),因此会有一个实际的最佳峰值(最大),该峰值是多少CPU内核将有助于改善一个人的处理意图.间接费用(在上面的[TIME]-和[SPACE]域中进行了概述)实际上会恶化任何潜在的积极影响预期.

The overall effects of n_jobs = a_reasonable_amount_of_processes is subject of Amdahl's Law ( the re-formulated one, not an add-on overhead-naive version ), so there will be a practical optimality peak ( a maximum ) of how many CPU-cores will help to improve one's processing intentions, beyond of which the overhead-costs ( sketched for both the [TIME]- and [SPACE]-domains above ) will actually deteriorate any potential positive impact expectations.

在生产中的大型数据集上使用了 RandomForestRegressor() ,我可以告诉您 [SPACE] 域在尝试进一步扩展n_jobs,没有任何系统级调整会克服此边界(因此越来越多的超低延迟RAM和越来越多的(实际)CPU内核是进入更大的计算计划).

Having used RandomForestRegressor() on indeed large data-sets in production, I can tell you the [SPACE]-domain is your worse of the enemies in trying to grow n_jobs any farther and none system-level tuning will ever overcome this boundary ( so more and more ultra-low latency RAM and more and more ( real ) CPU-cores is the only practical recipe for going into indeed any larger n_jobs computing plans ).

这篇关于如何在GridSearchCV(...,n_jobs = ...)中找到最佳进程数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆