如何在 GridSearchCV( ..., n_jobs = ... ) 中找到最佳进程数? [英] How to find an optimum number of processes in GridSearchCV( ..., n_jobs = ... )?

查看：23 发布时间：2021/12/25 14:44:18 python machine-learning parallel-processing scikit-learn parallelism-amdahl

本文介绍了如何在 GridSearchCV( ..., n_jobs = ... ) 中找到最佳进程数?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道，哪个更好地与 GridSearchCV( ..., n_jobs = ... ) 一起使用来为模型选择最佳参数集，n_jobs = -1 或 n_jobs 有很大的数字，
像 n_jobs= 30 ?

基于 Sklearn 文档:

<块引用>

n_jobs = -1 意味着计算将在所有计算机的 CPU.

在我的 PC 上，我有一个 Intel i3 CPU，它有 2 个内核和 4 个线程，这是否意味着如果我设置 n_jobs = -1，它会隐式地等于 n_jobs = 2 ?

解决方案

...这是否意味着如果我设置 n_jobs = -1，它会隐式地等于 n_jobs = 2 ?

这个很简单:

python(GridSearchCV() 中的 scipy/joblib)用于检测 CPU 核心的数量，这对于调度并发(独立)进程是合理的，前提是请求是使用 <代码>n_jobs = -1 设置.

看到 3 个 CPU 核心很有趣吗?

在一些可以综合模拟 CPU/内核的虚拟机案例中，结果并不像您已知的 Intel CPU/i3 案例那么简单.

如果有疑问，可以用一个简单的案例(在一个确实很小的数据集上，而不是成熟的模型空间搜索......)来测试这一点，然后让故事继续下去来证明这一点.

import psutil;打印({0:17s}{1:} CPU 物理".格式("psutil:",psutil.cpu_count( 逻辑 = 假 ) ) )经过;打印({0:17s}{1:} CPU 逻辑".格式("psutil:",psutil.cpu_count(逻辑=真)))...

类似的主机平台自我检测"可能会报告不同系统/设置的更多详细信息:

'''系统:Linux3.6.1(默认，2017 年 6 月 27 日，14:35:15).. [GCC 7.1.1 20170622 (Red Hat 7.1.1-3)]多处理:1 个 CPUpsutil:1 个 CPU 物理psutil:1 个 CPU 逻辑psutil: psutil.cpu_freq( per_cpu = True ) 无法报告.?( v5.1.0+ )psutil:5.0.1psutil: psutil.cpu_times( per_cpu = True ) 无法报告.?( vX.Y.Z+ )psutil:5.0.1psutil: svmem(total=1039192064, available=257290240, percent=75.2, used=641396736, free=190361600, active=581107712, inactive=140537856, buffers=3716,821 共享缓存=3517222 d20171622numexpr: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ModuleNotFoundError: 没有名为numexpr"的模块.joblib: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ModuleNotFoundError: 没有名为joblib"的模块.sklearn/joblib: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ModuleNotFoundError: 没有名为sklearn.externals.joblib"的模块'''

<小时>

或

''' [i5]>>>numexpr.print_versions()-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=Numexpr 版本:2.5NumPy 版本:1.10.4Python 版本:2.7.13 |Anaconda 4.0.0(32 位)|(默认，2017 年 5 月 11 日，14:07:41)[MSC v.1500 32 位(英特尔)]AMD/英特尔处理器?真的VML 可用吗?真的VML/MKL 版本:适用于 32 位应用程序的 Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021默认使用的线程数:4(检测到的 4 个内核中的)-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-='''

<小时><块引用>

... 最好与 GridSearchCV 一起使用来为模型选择最佳参数集，
n_jobs = -1 或 n_jobs 与像 n_jobs = 30 这样的大数字?

对此没有简单的一刀切"答案:

Scikit 工具(以及许多其他遵循此做法的工具)用于在使用 n_jobs 指令时产生所需数量的并发进程实例(以便避免共享 GIL 锁定步进 - 如果对详细信息感兴趣，请在别处阅读更多相关信息).

这个过程实例化不是免费的(在时间方面，即花费相当数量的[TIME]-域成本，而且还有空间-明智的做法是，即花费至少 n_jobs 倍 [SPACE]<中单个 python 进程实例的 RAM 分配/strong>-域).

因此，您的战斗是与双刃剑的战斗.

试图低估"CPU 可能会让(一些)CPU 内核闲置.
尝试超额预订"RAM 空间将使您的性能比预期更差，因为虚拟内存将导致操作系统交换，这将您的机器学习缩放数据访问时间从 ~ 10+[ns] 慢了 100,000 倍以上 ~ 10+ [ms] 这很难让人满意.

n_jobs = a_reasonable_amount_of_processes 的整体效果是 的主题Amdahl 定律(重新制定的，不是附加开销的天真的版本)，因此将有一个实际的最优峰值(最大值)，即多少 CPU 内核将有助于改进一个人的处理意图，除此之外，开销成本(为上面的 [TIME]- 和 [SPACE]- 域概述)实际上会恶化任何潜在的积极影响预期.

在生产中确实在大型数据集上使用了 RandomForestRegressor()，我可以告诉你 [SPACE]-domain 是您试图将 n_jobs 增长得更远的敌人，而且没有任何系统级调整能够克服这个边界(因此越来越多的超低延迟 RAM 和越来越多的(真实的)CPU 内核是进入任何更大的n_jobs 计算计划的唯一实用方法.

I'm wondering, which is better to use with GridSearchCV( ..., n_jobs = ... ) to pick the best parameter set for a model, n_jobs = -1 or n_jobs with a big number,
like n_jobs = 30 ?

Based on Sklearn documentation:

n_jobs = -1 means that the computation will be dispatched on all the CPUs of the computer.

On my PC I have an Intel i3 CPU, which has 2 cores and 4 threads, so does that mean if I set n_jobs = -1, implicitly it will be equal to n_jobs = 2 ?
解决方案

... does that mean if I set n_jobs = -1, implicitly it will be equal to n_jobs = 2 ?

This one is easy :

python ( scipy / joblib inside a GridSearchCV() ) used to detect the number of CPU-cores, that is reasonable to schedule concurrent ( independent ) processes, given a request was done with an n_jobs = -1 setting.

Funny to see a 3-CPU-core?

In some virtualised-machine cases, that can synthetically emulate CPU / cores, the results are not as trivial as in your known Intel CPU / i3 case.

If in doubts, one can test this with a trivialised case ( on an indeed small data-set, not the full-blown model-space search ... ) and let the story go on to prove this.
import psutil; print( "{0:17s}{1:} CPUs PHYSICAL".format( "psutil:", psutil.cpu_count( logical = False ) ) ) pass; print( "{0:17s}{1:} CPUs LOGICAL".format( "psutil:", psutil.cpu_count( logical = True ) ) ) ...
A similar host-platform "self-detection" may report more details for different systems / settings:
''' sys: linux 3.6.1 (default, Jun 27 2017, 14:35:15) .. [GCC 7.1.1 20170622 (Red Hat 7.1.1-3)] multiprocessing: 1 CPU(s) psutil: 1 CPUs PHYSICAL psutil: 1 CPUs LOGICAL psutil: psutil.cpu_freq( per_cpu = True ) not able to report. ?( v5.1.0+ ) psutil: 5.0.1 psutil: psutil.cpu_times( per_cpu = True ) not able to report. ?( vX.Y.Z+ ) psutil: 5.0.1 psutil: svmem(total=1039192064, available=257290240, percent=75.2, used=641396736, free=190361600, active=581107712, inactive=140537856, buffers=12210176, cached=195223552, shared=32768) numexpr: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ModuleNotFoundError: No module named 'numexpr'. joblib: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ModuleNotFoundError: No module named 'joblib'. sklearn/joblib: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ModuleNotFoundError: No module named 'sklearn.externals.joblib' '''

Or
''' [i5] >>> numexpr.print_versions() -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Numexpr version: 2.5 NumPy version: 1.10.4 Python version: 2.7.13 |Anaconda 4.0.0 (32-bit)| (default, May 11 2017, 14:07:41) [MSC v.1500 32 bit (Intel)] AMD/Intel CPU? True VML available? True VML/MKL version: Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021 for 32-bit applications Number of threads used by default: 4 (out of 4 detected cores) -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= '''

... which is better to use with GridSearchCV to pick the best parameter set for a model,
n_jobs = -1 or n_jobs with a big number like n_jobs = 30 ?

There is no easy "One-Size-Fits-All" answer on this :

The Scikit tools ( and many other followed this practice ) used to spawn, on n_jobs directive being used, a required amount of concurrent process-instances ( so as to escape from shared GIL-lock stepping - read more on this elsewhere if interested in details ).

This process-instantiation is not cost-free ( both time-wise, i.e. spending a respectfull amount of the [TIME]-domain costs, but also space-wise, i.e. spending at least an n_jobs-times the RAM-allocations of the single python process-instance in [SPACE]-domain ).

Given this, your fight is a battle against a dual-edged sword.

An attempt to "underbook" CPU will let ( some ) CPU-cores possibly idling.
An attempt to "overbook" RAM-space will turn your performance worse than expected, as virtual-memory will turn operating system swapping, which turns your Machine Learning-scaled data-access times from ~ 10+[ns] more than 100,000 x slower ~ 10+ [ms] which is hardly what one will be pleased at.

The overall effects of n_jobs = a_reasonable_amount_of_processes is subject of Amdahl's Law ( the re-formulated one, not an add-on overhead-naive version ), so there will be a practical optimality peak ( a maximum ) of how many CPU-cores will help to improve one's processing intentions, beyond of which the overhead-costs ( sketched for both the [TIME]- and [SPACE]-domains above ) will actually deteriorate any potential positive impact expectations.

Having used RandomForestRegressor() on indeed large data-sets in production, I can tell you the [SPACE]-domain is your worse of the enemies in trying to grow n_jobs any farther and none system-level tuning will ever overcome this boundary ( so more and more ultra-low latency RAM and more and more ( real ) CPU-cores is the only practical recipe for going into indeed any larger n_jobs computing plans ).

这篇关于如何在 GridSearchCV( ..., n_jobs = ... ) 中找到最佳进程数?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在 GridSearchCV( ..., n_jobs = ... ) 中找到最佳进程数? [英] How to find an optimum number of processes in GridSearchCV( ..., n_jobs = ... )?

问题描述

这个很简单:

对此没有简单的一刀切"答案:

This one is easy :

There is no easy "One-Size-Fits-All" answer on this :

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

如何在 GridSearchCV( ..., n_jobs = ... ) 中找到最佳进程数? [英] How to find an optimum number of processes in GridSearchCV( ..., n_jobs = ... )?

问题描述

这个很简单:

对此没有简单的一刀切"答案:

This one is easy :

There is no easy "One-Size-Fits-All" answer on this :

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭