如何在Python的循环中对操作进行多线程 [英] How to Multi-thread an Operation Within a Loop in Python

查看:1202
本文介绍了如何在Python的循环中对操作进行多线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我的名单很大,我正在执行类似的操作:

Say I have a very large list and I'm performing an operation like so:

for item in items:
    try:
        api.my_operation(item)
    except:
        print 'error with item'

我的问题有两个方面:

  • 有很多东西
  • api.my_operation永远需要返回

我想使用多线程一次启动一堆api.my_operations,以便我可以一次处理5或10甚至100个项目.

I'd like to use multi-threading to spin up a bunch of api.my_operations at once so I can process maybe 5 or 10 or even 100 items at once.

如果my_operation()返回一个异常(因为也许我已经处理过该项目),那就可以了.它不会破坏任何东西.循环可以继续到下一个项目.

If my_operation() returns an exception (because maybe I already processed that item) - that's OK. It won't break anything. The loop can continue to the next item.

注意:这是针对Python 2.7.3

Note: this is for Python 2.7.3

推荐答案

首先,在Python中,如果您的代码受CPU限制,则多线程将无济于事,因为只有一个线程可以持有全局解释器锁,因此可以运行一次使用Python代码.因此,您需要使用进程,而不是线程.

First, in Python, if your code is CPU-bound, multithreading won't help, because only one thread can hold the Global Interpreter Lock, and therefore run Python code, at a time. So, you need to use processes, not threads.

如果您的操作永远需要返回"是因为它是受IO约束的,也就是说,正在等待网络或磁盘副本等,这是不正确的.我待会儿再讲.

This is not true if your operation "takes forever to return" because it's IO-bound—that is, waiting on the network or disk copies or the like. I'll come back to that later.

接下来,一次处理5或10或100个项目的方法是创建一个由5或10或100个工作人员组成的池,并将这些项目放入由工作人员服务的队列中.幸运的是,stdlib multiprocessing

Next, the way to process 5 or 10 or 100 items at once is to create a pool of 5 or 10 or 100 workers, and put the items into a queue that the workers service. Fortunately, the stdlib multiprocessing and concurrent.futures libraries both wraps up most of the details for you.

对于传统编程,前者更强大,更灵活;如果您需要编写将来的等待,则后者更简单;对于微不足道的情况,选择哪一个并不重要. (在这种情况下,最明显的实现分别是futures用3行,multiprocessing用4行.)

The former is more powerful and flexible for traditional programming; the latter is simpler if you need to compose future-waiting; for trivial cases, it really doesn't matter which you choose. (In this case, the most obvious implementation with each takes 3 lines with futures, 4 lines with multiprocessing.)

如果您使用的是2.6-2.7或3.0-3.1,则不会内置futures,但可以从 PyPI (pip install futures).

If you're using 2.6-2.7 or 3.0-3.1, futures isn't built in, but you can install it from PyPI (pip install futures).

最后,如果您可以将整个循环迭代转换为函数调用(通常可以将其传递给map),那么并行化处理通常会容易得多,所以让我们首先进行以​​下操作:

Finally, it's usually a lot simpler to parallelize things if you can turn the entire loop iteration into a function call (something you could, e.g., pass to map), so let's do that first:

def try_my_operation(item):
    try:
        api.my_operation(item)
    except:
        print('error with item')


将它们放在一起:


Putting it all together:

executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(try_my_operation, item) for item in items]
concurrent.futures.wait(futures)


如果您有很多相对较小的工作,那么多处理的开销可能会浪费收益.解决该问题的方法是将工作分批处理成更大的工作.例如(使用 itertools食谱中的grouper,您可以复制并粘贴进入您的代码,或者从PyPI上的more-itertools项目获得):


If you have lots of relatively small jobs, the overhead of multiprocessing might swamp the gains. The way to solve that is to batch up the work into larger jobs. For example (using grouper from the itertools recipes, which you can copy and paste into your code, or get from the more-itertools project on PyPI):

def try_multiple_operations(items):
    for item in items:
        try:
            api.my_operation(item)
        except:
            print('error with item')

executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(try_multiple_operations, group) 
           for group in grouper(5, items)]
concurrent.futures.wait(futures)


最后,如果您的代码受IO约束怎么办?这样线程就和进程一样好,并且开销更少(限制更少,但是在这种情况下这些限制通常不会影响您).有时,较少的开销"足以表示您不需要使用线程进行批处理,但是您需要使用进程进行批处理,这是一个不错的选择.


Finally, what if your code is IO bound? Then threads are just as good as processes, and with less overhead (and fewer limitations, but those limitations usually won't affect you in cases like this). Sometimes that "less overhead" is enough to mean you don't need batching with threads, but you do with processes, which is a nice win.

那么,您如何使用线程而不是进程?只需将ProcessPoolExecutor更改为ThreadPoolExecutor.

So, how do you use threads instead of processes? Just change ProcessPoolExecutor to ThreadPoolExecutor.

如果不确定代码是CPU约束还是IO约束,只需尝试两种方法即可.

If you're not sure whether your code is CPU-bound or IO-bound, just try it both ways.

我可以对python脚本中的多个功能执行此操作吗?例如,如果我想并行化的代码中其他地方有另一个for循环.是否可以在同一脚本中执行两个多线程功能?

Can I do this for multiple functions in my python script? For example, if I had another for loop elsewhere in the code that I wanted to parallelize. Is it possible to do two multi threaded functions in the same script?

是的.实际上,有两种不同的方法可以做到这一点.

Yes. In fact, there are two different ways to do it.

首先,您可以共享同一(线程或进程)执行程序,并可以在多个地方使用它而没有问题.任务和未来的重点在于它们是独立的.您不在乎它们在哪里运行,只需将它们排队并最终得到答案即可.

First, you can share the same (thread or process) executor and use it from multiple places with no problem. The whole point of tasks and futures is that they're self-contained; you don't care where they run, just that you queue them up and eventually get the answer back.

或者,您可以在同一程序中有两个执行程序,这没有问题.这会降低性能,如果您同时使用两个执行器,您最终将试图在8个内核上运行(例如)16个繁忙线程,这意味着将需要进行一些上下文切换.但是有时候这样做是值得的,因为,例如,两个执行器很少同时忙碌,这使您的代码更加简单.也许一个执行程序正在运行可能需要一段时间才能完成的非常大的任务,而另一个执行程序却正在运行需要尽快完成的非常小的任务,因为响应能力比部分程序的吞吐量更重要.

Alternatively, you can have two executors in the same program with no problem. This has a performance cost—if you're using both executors at the same time, you'll end up trying to run (for example) 16 busy threads on 8 cores, which means there's going to be some context switching. But sometimes it's worth doing because, say, the two executors are rarely busy at the same time, and it makes your code a lot simpler. Or maybe one executor is running very large tasks that can take a while to complete, and the other is running very small tasks that need to complete as quickly as possible, because responsiveness is more important than throughput for part of your program.

如果您不知道哪种程序适合您,通常是第一个.

If you don't know which is appropriate for your program, usually it's the first.

这篇关于如何在Python的循环中对操作进行多线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆