在python中进行多处理以加快功能 [英] Multiprocessing in python to speed up functions

查看:69
本文介绍了在python中进行多处理以加快功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Python多处理感到困惑.

I am confused with Python multiprocessing.

我试图加快处理数据库中字符串的功能,但是我必须误解了多处理的工作原理,因为将其分配给工作人员池所花的时间比正常处理"所花的时间更长.

I am trying to speed up a function which process strings from a database but I must have misunderstood how multiprocessing works because the function takes longer when given to a pool of workers than with "normal processing".

这是我要实现的目标的一个示例.

Here an example of what I am trying to achieve.

from time import clock, time
from multiprocessing import Pool, freeze_support

from random import choice


def foo(x):
    TupWerteMany = []
    for i in range(0,len(x)):
         TupWerte = []
          s = list(x[i][3])
          NewValue = choice(s)+choice(s)+choice(s)+choice(s)
          TupWerte.append(NewValue)
          TupWerte = tuple(TupWerte)

          TupWerteMany.append(TupWerte)
     return TupWerteMany



 if __name__ == '__main__':
     start_time = time()
     List = [(u'1', u'aa', u'Jacob', u'Emily'),
        (u'2', u'bb', u'Ethan', u'Kayla')]
     List1 = List*1000000

     # METHOD 1 : NORMAL (takes 20 seconds) 
     x2 = foo(List1)
     print x2[1:3]

     # METHOD 2 : APPLY_ASYNC (takes 28 seconds)
     #    pool = Pool(4)
     #    Werte = pool.apply_async(foo, args=(List1,))
     #    x2 = Werte.get()
     #    print '--------'
     #    print x2[1:3]
     #    print '--------'

     # METHOD 3: MAP (!! DOES NOT WORK !!)

     #    pool = Pool(4)
     #    Werte = pool.map(foo, args=(List1,))
     #    x2 = Werte.get()
     #    print '--------'
     #    print x2[1:3]
     #    print '--------'


     print 'Time Elaspse: ', time() - start_time

我的问题:

  1. 为什么apply_async比正常方式"花费更长的时间?
  2. 我在做错地图吗?
  3. 通过多处理来加速此类任务是否有意义?
  4. 最后:毕竟,我在这里已经读完了,我想知道python中的多重处理是否可以在Windows上运行?

推荐答案

因此,您的第一个问题是foo(x)中没有发生实际的并行性,您将整个列表一次传递给该函数.

So your first problem is that there is no actual parallelism happening in foo(x), you are passing the entire list to the function once.

1) 进程池的想法是让许多进程对某些数据的单独位进行计算.

1) The idea of a process pool is to have many processes doing computations on separate bits of some data.

 # METHOD 2 : APPLY_ASYNC
 jobs = 4
 size = len(List1)
 pool = Pool(4)
 results = []
 # split the list into 4 equally sized chunks and submit those to the pool
 heads = range(size/jobs, size, size/jobs) + [size]
 tails = range(0,size,size/jobs)
 for tail,head in zip(tails, heads):
      werte = pool.apply_async(foo, args=(List1[tail:head],))
      results.append(werte)

 pool.close()
 pool.join() # wait for the pool to be done

 for result in results:
      werte = result.get() # get the return value from the sub jobs

只有在要处理四个块和完成四个作业的情况下,如果处理每个块所花费的时间大于启动该过程所花费的时间,这只会给您带来实际的加速.如果您有4个流程和100个工作要做.请记住,您已经四次创建一个全新的python解释器,这不是免费的.

This will only give you an actual speedup if the time it takes to process each chunk is greater than the time it takes to launch the process, in the case of four processes and four jobs to be done, of course these dynamics change if you've got 4 processes and 100 jobs to be done. Remember that you are creating a completely new python interpreter four times, this isn't free.

2)map的问题在于,它在单独的过程中将foo应用于List1中的每个元素,这将花费相当长的时间.因此,如果您有4个进程,则map将弹出列表中的一项四次,并将其发送给要处理的进程-等待进程完成-弹出列表中的更多内容-等待进程完成.仅当处理单个项目需要很长时间时才有意义,例如,如果每个项目都是指向一个1G文本文件的文件名.但是从目前的角度来看,map只会获取列表的单个字符串,并将其传递给foo,而apply_async则会获取列表的一部分.尝试以下代码

2) The problem you have with map is that it applies foo to EVERY element in List1 in a separate process, this will take quite a while. So if you're pool has 4 processes map will pop an item of the list four times and send it to a process to be dealt with - wait for process to finish - pop some more stuff of the list - wait for the process to finish. This makes sense only if processing a single item takes a long time, like for instance if every item is a file name pointing to a one gigabyte text file. But as it stands map will just take a single string of the list and pass it to foo where as apply_async takes a slice of the list. Try the following code

def foo(thing):
    print thing

map(foo, ['a','b','c','d'])

这是内置的python映射,将运行一个进程,但是对于多进程版本,想法是完全相同的.

That's the built-in python map and will run a single process, but the idea is exactly the same for the multiprocess version.

根据J.F. Sebastian的评论添加:但是,您可以使用mapchunksize参数为每个块指定一个近似大小.

Added as per J.F.Sebastian's comment: You can however use the chunksize argument to map to specify an approximate size of for each chunk.

pool.map(foo, List1, chunksize=size/jobs) 

我不知道Windows上的map是否存在问题,因为我没有可用于测试的问题.

I don't know though if there is a problem with map on Windows as I don't have one available for testing.

3)是的,因为您的问题足够大,足以证明派生出新的python解释器

3) yes, given that your problem is big enough to justify forking out new python interpreters

4)不能就此给出确切的答案,因为它取决于内核/处理器等的数量,但通常在Windows上应该没问题.

4) can't give you a definitive answer on that as it depends on the number of cores/processors etc. but in general it should be fine on Windows.

这篇关于在python中进行多处理以加快功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆