Python多处理性能仅会随着所使用内核数的平方根而提高 [英] Python multiprocessing performance only improves with the square root of the number of cores used

查看:125
本文介绍了Python多处理性能仅会随着所使用内核数的平方根而提高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Python(Windows Server 2012)中实现多处理,但是无法达到我期望的性能改进程度.特别是,对于几乎完全独立的一组任务,我期望具有更多核心的线性改进.


我了解到-尤其是在Windows上-打开新进程会涉及一些开销 [2] ;还是物流,如果我正在处理部分串行任务 [3] .

但是,当我运行多重处理时,在素数检查测试功能(下面的代码)上进行池化,我得到了接近完美的平方根关系,最高可达 N_cores=36 (物理核数)在服务器上)到达预期的性能时,再进入其他逻辑内核.


此处是我的性能测试结果的图解:
("标准化性能"是 [运行时间,其中<​​strong> 1 CPU核心] 除以 [使用 N 个CPU内核] 的运行时).

在多处理过程中收益急剧下降是否正常?还是我的实现中缺少某些东西?


import numpy as np
from multiprocessing import Pool, cpu_count, Manager
import math as m
from functools import partial
from time import time

def check_prime(num):

    #Assert positive integer value
    if num!=m.floor(num) or num<1:
        print("Input must be a positive integer")
        return None

    #Check divisibility for all possible factors
    prime = True
    for i in range(2,num):
        if num%i==0: prime=False
    return prime

def cp_worker(num, L):
    prime = check_prime(num)
    L.append((num, prime))


def mp_primes(omag, mp=cpu_count()):
    with Manager() as manager:
        np.random.seed(0)
        numlist = np.random.randint(10**omag, 10**(omag+1), 100)

        L = manager.list()
        cp_worker_ptl = partial(cp_worker, L=L)

        try:
            pool = Pool(processes=mp)   
            list(pool.imap(cp_worker_ptl, numlist))
        except Exception as e:
            print(e)
        finally:
            pool.close() # no more tasks
            pool.join()

        return L


if __name__ == '__main__':
    rt = []
    for i in range(cpu_count()):
        t0 = time()
        mp_result = mp_primes(6, mp=i+1)
        t1 = time()
        rt.append(t1-t0)
        print("Using %i core(s), run time is %.2fs" % (i+1, rt[-1]))

注意:我知道,对于此任务,实现多 threading 可能会更有效,但是实际使用的简化脚本是由于GIL与Python多线程不兼容.

解决方案

@KellanM 应该 [+ 1] 用于定量绩效监控

我在实现过程中遗漏了一些东西吗?

是的,您从流程管理的所有附加成本中抽象出来. /h2>

尽管您已表示期望"带有其他核心的线性改进.",但实际上由于种种原因几乎不会出现这种情况(即使对共产主义的大肆宣传也失败了)免费提供任何内容).

基因AMDAHL已制定了收益递减的初始法则.
重新制定的版本 的最新版本还考虑了流程管理的影响{setup | terminate}-附加的间接费用,并试图应对处理的原子性(鉴于大型工作包有效负载无法轻松地重新定位/重新分配超出了大多数常见编程系统中可用的免费CPU内核池(除了某些确实特定的微调度技术(例如,语义设计的PARLANSE或LLNL的SISAL所展示的技术在过去如此丰富多彩).


最好的下一步吗?

如果确实对此领域感兴趣,可以始终通过实验来测量和比较过程管理的实际成本(加上数据流成本,再加上内存分配成本,...直到过程终止和结果重组为止以便在数量上公平地记录并评估使用更多CPU核的附加成本/收益比(在python中将恢复整个python解释器状态,包括所有状态)内存状态,然后在第一个衍生和设置过程中执行第一个有用的操作之前.

表现不佳(对于下面的前一种情况)
如果不是灾难性的影响(对于下面的后一种情况),则是任何一种设计不当的资源映射策略,都是
"预订不足"-来自 CPU 核心池的资源

"预订过量"-来自以下资源池的资源 RAM -空间
也在此处 讨论了 >

上面的重新制定的阿姆达尔定律的链接将帮助您评估收益递减的点,而不是付出比以往任何时候都高的代价.

Hoefinger等人的实验可以作为一个很好的实践证据,说明处理节点的数量如何增长(无论是本地O/S管理的CPU内核还是NUMA分布式体系结构节点)将开始降低结果的性能,
收益递减点(在不可知论的阿姆达尔定律中得到证明)
实际上将开始成为一个积分,在此之后您支付的费用会比获得的多.:

在这个有趣的领域祝你好运!


最后但并非最不重要

NUMA/非本地性问题引起了人们的注意,他们参与了针对HPC级调整的扩展规模(缓存/内存中计算策略)的讨论,并且有可能-作为副作用-帮助检测缺陷(如由 @eryksun 报告).您可以随时使用 lstopo 工具查看平台的实际NUMA拓扑,以查看调度了"just"- [CONCURRENT] 在这样的NUMA-resources-topology上执行任务:

I am attempting to implement multiprocessing in Python (Windows Server 2012) and am having trouble achieving the degree of performance improvement that I expect. In particular, for a set of tasks which are almost entirely independent, I would expect a linear improvement with additional cores.


I understand that--especially on Windows--there is overhead involved in opening new processes [1], and that many quirks of the underlying code can get in the way of a clean trend. But in theory the trend should ultimately still be close to linear for a fully parallelized task [2]; or perhaps logistic if I were dealing with a partially serial task [3].

However, when I run multiprocessing.Pool on a prime-checking test function (code below), I get a nearly perfect square-root relationship up to N_cores=36 (the number of physical cores on my server) before the expected performance hit when I get into the additional logical cores.


Here is a plot of my performance test results :
( "Normalized Performance" is [ a run time with 1 CPU-core ] divided by [ a run time with N CPU-cores ] ).

Is it normal to have this dramatic diminishing of returns with multiprocessing? Or am I missing something with my implementation?


import numpy as np
from multiprocessing import Pool, cpu_count, Manager
import math as m
from functools import partial
from time import time

def check_prime(num):

    #Assert positive integer value
    if num!=m.floor(num) or num<1:
        print("Input must be a positive integer")
        return None

    #Check divisibility for all possible factors
    prime = True
    for i in range(2,num):
        if num%i==0: prime=False
    return prime

def cp_worker(num, L):
    prime = check_prime(num)
    L.append((num, prime))


def mp_primes(omag, mp=cpu_count()):
    with Manager() as manager:
        np.random.seed(0)
        numlist = np.random.randint(10**omag, 10**(omag+1), 100)

        L = manager.list()
        cp_worker_ptl = partial(cp_worker, L=L)

        try:
            pool = Pool(processes=mp)   
            list(pool.imap(cp_worker_ptl, numlist))
        except Exception as e:
            print(e)
        finally:
            pool.close() # no more tasks
            pool.join()

        return L


if __name__ == '__main__':
    rt = []
    for i in range(cpu_count()):
        t0 = time()
        mp_result = mp_primes(6, mp=i+1)
        t1 = time()
        rt.append(t1-t0)
        print("Using %i core(s), run time is %.2fs" % (i+1, rt[-1]))

Note: I am aware that for this task it would likely be more efficient to implement multithreading, but the actual script for which this one is a simplified analog is incompatible with Python multithreading due to GIL.

解决方案

@KellanM deserved [+1] for quantitative performance monitoring

am I missing something with my implementation?

Yes, you abstract from all add-on costs of the process-management.

While you have expressed an expectation of " a linear improvement with additional cores. ", this would hardly appear in practice for several reasons ( even the hype of communism failed to deliver anything for free ).

Gene AMDAHL has formulated the inital law of diminishing returns.
A more recent, re-formulated version, took into account also the effects of process-management {setup|terminate}-add-on overhead costs and tried to cope with atomicity-of-processing ( given large workpackage payloads cannot get easily re-located / re-distributed over available pool of free CPU-cores in most common programming systems ( except some indeed specific micro-scheduling art, like the one demonstrated in Semantic Design's PARLANSE or LLNL's SISAL have shown so colourfully in past ).


A best next step?

If indeed interested in this domain, one may always experimentally measure and compare the real costs of process management ( plus data-flow costs, plus memory-allocation costs, ... up until the process-termination and results re-assembly in the main process ) so as to quantitatively fair record and evaluate the add-on costs / benefit ratio of using more CPU-cores ( that will get, in python, re-instated the whole python-interpreter state, including all its memory-state, before a first usefull operation will get carried out in a first spawned and setup process ).

Underperformance ( for the former case below )
if not disastrous effects ( from the latter case below ),
of either of ill-engineered resources-mapping policy, be it
an "under-booking"-resources from a pool of CPU-cores
or
an "over-booking"-resources from a pool of RAM-space
are discussed also here

The link to the re-formulated Amdahl's Law above will help you evaluate the point of diminishing returns, not to pay more than will ever receive.

Hoefinger et Haunschmid experiments may serve as a good practical evidence, how a growing number of processing-nodes ( be it a local O/S managed CPU-core, or a NUMA distributed architecture node ) will start decreasing the resulting performance,
where a Point of diminishing returns ( demonstrated in overhead agnostic Amdahl's Law )
will actually start to become a Point after which you pay more than receive. :

Good luck on this interesting field!


Last, but not least,

NUMA / non-locality issues get their voice heard, into the discussion of scaling for HPC-grade tuned ( in-Cache / in-RAM computing strategies ) and may - as a side-effect - help detect the flaws ( as reported by @eryksun above ). One may feel free to review one's platform actual NUMA-topology by using lstopo tool, to see the abstraction, that one's operating system is trying to work with, once scheduling the "just"-[CONCURRENT] task execution over such a NUMA-resources-topology:

这篇关于Python多处理性能仅会随着所使用内核数的平方根而提高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆