我应该并行运行多少个进程? [英] How many processes should I run in parallel?

查看:212
本文介绍了我应该并行运行多少个进程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个并行任务,可以从多个文件中读取内容,并将信息写到多个文件中.

I have a parallelized task that reads stuff from multiple files, and writes it out the information to several files.

我当前用来并行化东西的惯用法:

The idiom I am currently using to parallelize stuff:

listOfProcesses = []
for fileToBeRead in listOfFilesToBeRead:
    process = multiprocessing.Process(target = somethingThatReadsFromAFileAndWritesSomeStuffOut, args = (fileToBeRead))
    process.start()
    listOfProcesses.append(process)

for process in listOfProcesses:
    process.join()

值得注意的是,somethingThatReadsFromAFileAndWritesSomeStuffOut本身可能会并行化任务(它可能必须从其他文件等中读取).

It is worth noting that somethingThatReadsFromAFileAndWritesSomeStuffOut might itself parallelize tasks (it may have to read from other files, etc. etc.).

现在,如您所见,正在创建的进程数不取决于我计算机的核心数或其他任何内容,除了需要完成多少任务.如果需要运行十个任务,请创建十个进程,依此类推.

Now, as you can see, the number of processes being created doesn't depend upon the number of cores I have on my computer, or anything else, except for how many tasks need to be completed. If ten tasks need to be run, create ten processes, and so on.

这是创建任务的最佳方法吗?相反,我应该考虑我的处理器有多少个内核等吗?

Is this the best way to create tasks? Should I instead think about how many cores my processor has, etc.?

推荐答案

始终将进程数与任务数分开.没有理由为什么两者应该是相同的,并且通过将进程数设为变量,您可以进行实验以查看哪种方法可以很好地解决您的特定问题.没有理论上的答案能像用真实数据进行老式的动手基准测试那样好.

Always separate the number of processes from the number of tasks. There's no reason why the two should be identical, and by making the number of processes a variable, you can experiment to see what works well for your particular problem. No theoretical answer is as good as old-fashioned get-your-hands-dirty benchmarking with real data.

以下是使用多处理池的方法:

Here's how you could do it using a multiprocessing Pool:

import multiprocessing as mp

num_workers = mp.cpu_count()  

pool = mp.Pool(num_workers)
for task in tasks:
    pool.apply_async(func, args = (task,))

pool.close()
pool.join()


pool = mp.Pool(num_workers)将创建一个num_workers子进程池. num_workers = mp.cpu_count()会将num_workers设置为等于CPU内核数.您可以通过更改此数字进行试验. (请注意,pool = mp.Pool()创建一个N子进程池,其中N等于 mp.cpu_count()默认情况下.)


pool = mp.Pool(num_workers) will create a pool of num_workers subprocesses. num_workers = mp.cpu_count() will set num_workers equal to the number of CPU cores. You can experiment by changing this number. (Note that pool = mp.Pool() creates a pool of N subprocesses, where N equals mp.cpu_count() by default.)

如果问题是受CPU限制的,则将num_workers设置为大于内核数的数字没有任何好处,因为计算机并行运行的进程数不能超过内核数.此外,如果num_workers超过内核数,则在进程之间进行切换可能会使性能变差.

If a problem is CPU-bound, there is no benefit to setting num_workers to a number bigger than the number of cores, since the machine can't have more processes operating concurrently than the number of cores. Moreover, switching between the processes may make performance worse if num_workers exceeds the number of cores.

如果问题是受IO限制的-您的可能是因为他们正在做文件IO-如果您的IO设备,num_workers超过内核数可能很有意义(s)可以处理的并发任务多于核心.但是,如果您的IO本质上是顺序的-例如,如果只有一个硬盘只有一个读/写磁头-那么除了一个子进程外,所有其他子进程都可能被阻塞,等待IO设备.在这种情况下,并发是不可能的,并且在这种情况下使用多重处理可能比等效的顺序代码要慢.

If a problem is IO-bound -- which yours might be since they are doing file IO -- it may make sense to have num_workers exceed the number of cores, if your IO device(s) can handle more concurrent tasks than you have cores. However, if your IO is sequential in nature -- if, for example, there is only one hard drive with only one read/write head -- then all but one of your subprocesses may be blocked waiting for the IO device. In this case no concurrency is possible and using multiprocessing in this case is likely to be slower than the equivalent sequential code.

这篇关于我应该并行运行多少个进程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆