我如何使用更多的CPU运行我的python脚本 [英] How can I use more CPU to run my python script

查看:233
本文介绍了我如何使用更多的CPU运行我的python脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用更多的处理器来运行我的代码,以仅减少运行时间.虽然我尝试这样做,但未能获得预期的结果.我的代码很大,这就是为什么我在这里给出一个非常小而简单的代码(尽管它不需要并行作业来运行此代码)只是为了知道如何在python中进行并行作业.任何意见/建议将不胜感激.

I want to use more processors to run my code to minimize the running time only. Though I have tried to do it but failed to get the desired result. My code is a very big one that's why I'm giving here a very small and simple code (though it does not need parallel job to run this code) just to know how can I do parallel job in python. Any comments/ suggestions will be highly appreciated.

import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint


def solveit(n,y0):
    def exam(y, x):
        theta, omega = y
        dydx = [omega, - (2.0/x)*omega - theta**n]
        return dydx

    x = np.linspace(0.1, 10, 100)

    #call integrator
    sol = odeint(exam, y0, x)

    plt.plot(x, sol[:, 0], label='For n = %s,y0=(%s,%s)'%(n,y0[0],y0[1]))


ys= [[1.0, 0.0],[1.2, 0.2],[1.3, 0.3]]

fig = plt.figure()
for y_ in ys:
    solveit(1.,y_)

plt.legend(loc='best')
plt.grid()
plt.show() 

推荐答案

问:如何使用更多的CPU运行python脚本?

首先,关于"游戏的因素"的几点评论将如何将更多的CPU计入处理任务的执行流程:
(详细示例如下)

A few remarks first, on "The Factors of The Game" how any more CPU might at all get counted into the flow of the processing-tasks execution:
( detailed examples follow )

  • 从合理的状态开始从 re -组织的过程流中实现合理加速的成本一种可行的并行代码执行方式
  • 已知 python 限制,用于执行任何并行计算密集型策略以了解
  • python脚本本身,即代码看起来会有所不同,大多数情况是尝试利用MPI分布的内存并行性(跨一组{cluster | grid} -connected"操作) -机器
  • The Costs of going to achieve a reasonable speedup from a re-organise'd process-flow from an as-is state into a feasible parallel-code execution fashion
  • Known python Limits for executing any parallel computing-intensive strategy to know about
  • python script itself, i.e. The Code will look way different, most if doing an attempt to harness an MPI-distributed memory parallelism, operated "across" a set of {cluster|grid}-connected-machines

[PARALLEL] 流程是流程流程组织的最复杂形式:并行化的流程必须同时启动,执行和完成,通常是在时间限制内,因此任何不确定性应该避免阻塞或其他不确定性来源(不是即时"缓解,避免,主要是防止-这很难)

[PARALLEL] process flow is the most complicated form of process-flow organisation: parallelised processes must start, execute and also complete at the same time, typically within a time-constraint, so any indeterministic blocking or other source of uncertainty ought be avoided (not "just" mitigated on-the-fly, avoided, principally prevented - and that is hard)

[CONCURRENT] 流程更容易实现,考虑到有更多的免费资源,基于并发策略的流程调度器可以引导一些工作流(线程)开始在其上执行这样的免费资源(磁盘I/O,CPU执行等),并且还可以在某些调度程序确定了一定的时间量并暂时退出使用仅暂时借用的设备/资源,因此在调度程序的并发调度策略队列中发生了不确定的长时间或优先级驱动的等待之后,轮到另一个工作流(线程)候选人了.

[CONCURRENT] process flow is way easier to achieve, given there are more free resources, the concurrency-policy based process-scheduler can direct some work-streams ( threads ) to start being executed on such a free resource ( disk-I/O, CPU-execution, etc ) and also can "enforce" such work being soft-signalled or force-fully interrupted after some scheduler's side decided amount of time and temporarily evicted from using a "just-for-a-moment-lended" device/resource, so as another work-stream ( thread ) candidate's turn has come, after indeterministically long or priority-driven waiting in the scheduler's concurrent-scheduling policy queue took place.

[SERIAL] 流程是最简单的形式-一步一步地进行,没有实时传递的任何压力-mañana(maˈɲana;英语məˈnjɑːnə) n,adv .. b.其他时间及以后的时间"

[SERIAL] process flow is the simplest form - one step after another after another without any stress from real-time passing along - "mañana (maˈɲana; English məˈnjɑːnə) n, adv .. b. some other and later time"

Python解释器自从 damned- [SERIAL] 以来就一直存在,即使语法构造函数为基于{ lightweight- THREAD 重量级-完整副本- PROCESS }形式的并发"代码调用

Python interpreter has been since ever damned-[SERIAL], even when syntax-constructors have brought tools for both { lightweight-THREAD-based | heavyweight-full-copy-PROCESS }-based forms of "concurrent"-code-invocations

已知轻量级形式仍然依赖python-GIL-lock,这使得实际执行再次重新-[SERIAL]-,方法是暂时借用中央解释器的GIL-lock一种轮循方式,由恒定的时间驱动任何大的线程群.最后的结果再次是[SERIAL],这对于外部"-延迟掩盖(示例)很有用,但绝不用于HPC级计算...

Lightweight form is know to still rely on python-GIL-lock, which makes the actual execution re-[SERIAL]-ised again, right by temporarily lending the central interpreters' GIL-lock in a round-robin fashion, driven by a constant amount of time to whatever big herd-of-THREADs. The result is finally [SERIAL] again and this can be useful for "external"-latency-masking (example), but never for HPC-grade computing...

即使是逃避GIL的尝试来支付所有费用并利用基于完整副本- PROCESS [CONCURRENT]-代码执行的重量级形式,也并非没有麻烦-只是仔细阅读有关崩溃的警告,并在泄漏后挂起很少的很少见的资源,直到下一个平台重新启动(!):

Even the GIL-escaping attempts to pay all costs and harness the heavyweight-form of the full-copy-PROCESS-based [CONCURRENT]-code execution are not free from headaches - just read carefully the warnings about crashes and hung the few, very rare resources after leaks, till the next platform reboot(!):

在3.8版中已更改 :在macOS上,spawn启动方法现在是默认设置. fork启动方法应认为不安全,因为它可能导致子流程崩溃.参见 bpo-33725 .

在版本3.4中进行了更改: 在所有unix平台上添加了spawn,并在某些unix平台上添加了forkserver.子进程不再继承Windows上所有父代的可继承句柄.

在Unix上,使用spawnforkserver的启动方法还将启动资源跟踪器进程,该进程跟踪由程序的进程创建的未链接的命名系统资源(例如,命名信号量或SharedMemory对象).当所有进程都退出后,资源跟踪器将取消链接任何剩余的跟踪对象.通常不应该有任何资源,但是如果某个进程被信号杀死,则可能会有一些泄漏"资源. (泄漏的信号量和共享的内存段都不会自动取消链接,直到下次重新启动..这对于两个对象都是有问题的,因为系统仅允许有限数量的命名信号量,并且共享的内存段在其中占用了一些空间.主存储器.)

Changed in version 3.8: On macOS, the spawn start method is now the default. The fork start method should be considered unsafe as it can lead to crashes of the subprocess. See bpo-33725.

Changed in version 3.4: spawn added on all unix platforms, and forkserver added for some unix platforms. Child processes no longer inherit all of the parents inheritable handles on Windows.

On Unix using the spawn or forkserver start methods will also start a resource tracker process which tracks the unlinked named system resources (such as named semaphores or SharedMemory objects) created by processes of the program. When all processes have exited the resource tracker unlinks any remaining tracked object. Usually there should be none, but if a process was killed by a signal there may be some "leaked" resources. (Neither leaked semaphores nor shared memory segments will be automatically unlinked until the next reboot. This is problematic for both objects because the system allows only a limited number of named semaphores, and shared memory segments occupy some space in the main memory.)

我们大多数时候都会对一个良好的代码设计感到满意,该代码设计经过精心修饰,适合python,并增加了一些智能矢量化和 [CONCURRENT] 处理组织.

We will be most of the time happy with a good code-design, polished for the python, augmented with some sorts of smart-vectorisation and [CONCURRENT] processing organisation.

真正的 [PARALLEL] 代码执行是一件事,很可能没人会尝试在确定性的GIL中断的python [SERIAL]-代码解释器中实现(自2019年3月起,游戏似乎显然已经先验地输掉了.)

The true [PARALLEL] code execution is a thing most probably no one would ever try to implement inside deterministically GIL-interrupted python [SERIAL]-code interpreter ( as of the 2019-3Q, this Game seems obvious to have already been lost a priori ).

总是存在成本.

对于基于THREAD的尝试而言较小,对于基于PROCESS的方法而言则较小,对于将代码重构为分布式内存并行性(使用MPI进程间通信中介工具或其他形式的分布式)而言最大

Smaller for THREAD-based attempts, larger for PROCESS-based attemtps, biggest for refactoring the code into distributed-memory parallelism ( using MPI-inter-process communications mediating tools or other form of going distributed )

每个语法技巧都有一些附加成本,即,花费在 [TIME] 中的时间以及 [SPACE] 在内部部分"(有用的代码)开始为我们工作之前(并希望加快整体运行时间)是否需要这样做.如果这些附加成本一次性付清(处理设置成本+参数转移成本+协调与通信成本+结果收集成本+处理终止成本)相同,则差于所寻求的成本为了加速,您突然发现自己付出的比所得到的还要多.

Each syntax-trick has some add-on costs, i.e. how long does it take in [TIME] and how big add-on memory-allocations in [SPACE] does it take, before the "internal-part" ( the useful code ) starts to work for us ( and hopefully accelerate the overall run-time ). If these add-on costs for a lumpsum of ( processing-setup costs + parameters-transfer costs + coordination-and-communication costs + collection-of-results costs + processing-termination costs ) are the same, the worse higher than a sought for acceleration, you suddenly find yourself to pay more than you receive.

当没有用于测试热点的最终工作代码时,可能会注入以下类似的crash-test-dummy代码,CPU和RAM将获得压力测试工作负荷:

When not having a final working code for testing the hot-spot, one may inject something like this crash-test-dummy code, the CPU and RAM will get a stress-test workload:

##########################################################################
#-EXTERNAL-zmq.Stopwatch()'d-.start()-.stop()-clocked-EXECUTION-----------
#
def aFATpieceOfRAMallocationAndNUMPYcrunching( aRAM_size_to_allocate =  1E9,
                                               aCPU_load_to_generate = 20
                                               ):
    #-XTRN-processing-instantiation-COSTs
    #---------------------------------------------------------------------
    #-ZERO-call-params-transfer-COSTs
    #---------------------------------------------------------------------
    #-HERE---------------------------------RAM-size'd-STRESS-TEST-WORKLOAD
    _ = numpy.random.randint( -127,
                               127,
                               size  = int( aRAM_size_to_allocate ),
                               dtype = numpy.int8
                               )
    #---------------------------------------------------------------------
    #-HERE-----------------------------------CPU-work-STRESS-TEST-WORKLOAD
    # >>> aClk.start();_ = numpy.math.factorial( 2**f );aClk.stop()
    #              30 [us] for f =  8
    #             190 [us] for f = 10
    #           1 660 [us] for f = 12
    #          20 850 [us] for f = 14
    #         256 200 [us] for f = 16
    #       2 625 728 [us] for f = 18
    #      27 775 600 [us] for f = 20
    #     309 533 629 [us] for f = 22
    #  +3 ... ... ... [us] for f = 24+ & cluster-scheduler may kill job
    # +30 ... ... ... [us] for f = 26+ & cluster-manager may block you
    # ... ... ... ... [us] for f = 28+ & cluster-owner will hunt you!
    #
    return len( str( [ numpy.math.factorial( 2**f )
                                            for f in range( min( 22,
                                                                 aCPU_load_to_generate
                                                                 )
                                                            )
                       ][-1]
                     )
                ) #---- MAY TRY TO return( _.astype(  numpy.int64 )
                #------                  + len( str( [numpy.math.factorial(...)...] ) )
                #------                    )
                #------         TO TEST also the results-transfer COSTs *
                #------                      yet, be careful +RAM COSTs *
                #------                      get explode ~8+ times HERE *
#
#-EXTERNAL-ZERO-results-transfer-and-collection-COSTs
#########################################################################


如何避免面对 的最终讽刺不是很糟糕的交易,不是吗?"

在花费时间和预算之前,请进行公平的分析,确定热点基准并扩展到超出教科书示例数据大小的范围. 仅编码"在这里不起作用.


How to avoid facing a final sarcasm of " A lousy bad deal, isn't it? "

Do a fair analysis, benchmark hot-spots and scale beyond a schoolbook example sizes of data well before you spend your time and budget. "Just coding" does not work here.

为什么?
单一的错误" SLOC可能会破坏结果性能,使时间延长大约+ 37%以上,或者提高性能,以花费少于基线处理时间的-57%的时间

Why?
A single "wrong" SLOC may devastate the resulting performance into more than about +37% longer time or may improve performance to spend less than -57% of the baseline processing time.

过早的优化非常危险.

Pre-mature optimisations are awfully dangerous.

成本/收益分析可以在支出之前说明事实. 阿姆达尔定律可能会帮助您确定收支平衡点并给出一个主要限制,之后任意数量的免费资源(甚至无限多的资源(

Costs/benefits analysis tells the facts before spending your expenses. Amdahl's law may help you decide a breakeven point and gives one also a principal limit, after which any number of free resources ( even infinitely many resouces ( watch this fully interactive analysis and try to move the p-slider, for the [PARALLEL]-fraction of the processing, anywhere lower than the un-realistic 100% parallel-code, so as to smell the smoke of the real-life fire ) ) will not yield a bit of speedup for your code processing-flow.

numpyscipy等这样的经过性能优化的库中的智能矢量化技巧,可以并且将在内部使用多个CPU内核,而python却不知道或不注意这一点.学习矢量化代码技巧,您的代码将受益匪浅.

Smart vectorised tricks in performance-polished libraries like numpy, scipy et al, can and will internally use multiple CPU-cores, without python knowing or taking care about that. Learn vectorised-code tricks and your code will benefit a lot.

numba LLVM编译器还可以在以下情况下提供帮助:应该从您的CPU引擎压缩最终性能,而代码不能依赖使用智能numpy性能技巧.

Also a numba LLVM compiler can help in cases, where ultimate performance ought be squeezed from your CPU-engine, where code cannot rely on use of the smart numpy performance tricks.

更难的是可以使用其他{pre | jit}编译的python代码时尚,从而摆脱仍然处于GIL锁定陷阱的陷阱- [SERIAL] -逐步代码执行.

Yet harder could be to go into other {pre|jit}-compiled-fashions of python-code, so as to escape from the trap of GIL-lock still-[SERIAL]-stepping of a code-execution.

拥有尽可能多的CPU核心总是可以的.利用在多处理器芯片中本地可用的所有此类CPU内核,在NUMA体系结构中更糟,在至少每个连接的计算节点(MPI和其他形式的基于消息的形式)的松散耦合的独立分布式生态系统中,最糟糕的情况协调其他自治计算节点).

Having as much as possible CPU-cores is fine, always. Harnessing all such CPU-cores available locally in a multiprocessor chip, the worse in a NUMA-architecture fabric, the worst in a distributed ecosystem of separate, loosely-coupled set of at least connected computing nodes ( MPI and other forms of message-based coordination of otherwise autonomous computing nodes ).

尽管" 让他们真正为您工作 "的实际成本可能比实际完成的收益要高(重构+

Though the real costs of "getting em' indeed work for you" could be higher than a benefit of actually doing it ( re-factoring + debugging + proof-of-correctness + actual work + collecting of results ).

帕金森定律很明确-如果可能出现问题,那么它会在瞬间造成最大伤害.

The Parkinsons Law is clear - if something may get wrong, it gets in such a moment that it can cause the maximum harm.

:o)因此,对前进的道路保持乐观-我将向您保证这将是一个疯狂的旅程

:o) so be optimistic on the way forward - it will be a wild ride, I can promise you

这篇关于我如何使用更多的CPU运行我的python脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆