线程和多处理模块之间有什么区别? [英] What are the differences between the threading and multiprocessing modules?

查看:52
本文介绍了线程和多处理模块之间有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习如何在Python中使用threadingmultiprocessing模块来并行运行某些操作并加速我的代码.

I am learning how to use the threading and the multiprocessing modules in Python to run certain operations in parallel and speed up my code.

我发现很难理解threading.Thread()对象和multiprocessing.Process()对象之间的区别是什么(也许是因为我没有任何理论背景).

I am finding this hard (maybe because I don't have any theoretical background about it) to understand what the difference is between a threading.Thread() object and a multiprocessing.Process() one.

另外,对我来说,如何实例化一个作业队列并使其只有4个(例如)并行运行,而另一个则等待资源释放后再执行,对我来说还不是很清楚.

Also, it is not entirely clear to me how to instantiate a queue of jobs and having only 4 (for example) of them running in parallel, while the other wait for resources to free before being executed.

我发现文档中的示例很清楚,但并不十分详尽;一旦我尝试使事情复杂化,就会收到很多奇怪的错误(例如无法腌制的方法,等等).

I find the examples in the documentation clear, but not very exhaustive; as soon as I try to complicate things a bit, I receive a lot of weird errors (like a method that can't be pickled, and so on).

那么,什么时候应该使用threadingmultiprocessing模块?

So, when should I use the threading and multiprocessing modules?

您能否将我链接到一些资源,以解释这两个模块的概念以及如何将其正确用于复杂的任务?

Can you link me to some resources that explain the concepts behind these two modules and how to use them properly for complex tasks?

推荐答案

Giulio Franco所说的话对于多线程与多处理是正确的一般.

What Giulio Franco says is true for multithreading vs. multiprocessing in general.

但是,Python * 还有一个问题:全局解释器锁可以防止同一进程中的两个线程同时运行Python代码.这意味着,如果您有8个内核,并且将代码更改为使用8个线程,则它将无法使用800%的CPU并无法以更快的速度运行8倍.它会使用相同的100%CPU,并以相同的速度运行. (实际上,它的运行速度会稍慢一些,因为即使您没有任何共享数据,线程处理也会带来额外的开销,但是现在暂时忽略它.)

However, Python* has an added issue: There's a Global Interpreter Lock that prevents two threads in the same process from running Python code at the same time. This means that if you have 8 cores, and change your code to use 8 threads, it won't be able to use 800% CPU and run 8x faster; it'll use the same 100% CPU and run at the same speed. (In reality, it'll run a little slower, because there's extra overhead from threading, even if you don't have any shared data, but ignore that for now.)

对此有一些例外.如果代码的繁重计算实际上不是在Python中发生,而是在某些具有自定义C代码且可以正确执行GIL处理的库中(例如numpy应用程序),那么线程将为您带来预期的性能收益.如果繁重的计算是由您运行并等待的某个子进程完成的,那么情况也是如此.

There are exceptions to this. If your code's heavy computation doesn't actually happen in Python, but in some library with custom C code that does proper GIL handling, like a numpy app, you will get the expected performance benefit from threading. The same is true if the heavy computation is done by some subprocess that you run and wait on.

更重要的是,在某些情况下,这无关紧要.例如,网络服务器花费大部分时间从网络上读取数据包,而GUI应用花费大部分时间来等待用户事件.在网络服务器或GUI应用程序中使用线程的原因之一是允许您执行长时间运行的后台任务",而又不会阻止主线程继续为网络数据包或GUI事件提供服务.这在Python线程中工作得很好. (从技术上讲,这意味着Python线程为您提供了并发性,即使它们没有为您提供核心并行性.)

More importantly, there are cases where this doesn't matter. For example, a network server spends most of its time reading packets off the network, and a GUI app spends most of its time waiting for user events. One reason to use threads in a network server or GUI app is to allow you to do long-running "background tasks" without stopping the main thread from continuing to service network packets or GUI events. And that works just fine with Python threads. (In technical terms, this means Python threads give you concurrency, even though they don't give you core-parallelism.)

但是,如果您使用纯Python编写受CPU约束的程序,则使用更多线程通常无济于事.

But if you're writing a CPU-bound program in pure Python, using more threads is generally not helpful.

使用单独的进程在GIL中没有这样的问题,因为每个进程都有自己的单独的GIL.当然,与其他语言相比,线程和进程之间仍然具有所有相同的权衡关系–在进程之间共享数据比在线程之间共享更加困难,而且成本更高,运行大量进程或创建和销毁这些开销可能非常昂贵.它们经常出现,等等.但是GIL在处理方面的平衡上占了很大比重,对于C或Java而言,这是不正确的.因此,与使用C或Java相比,您会发现在Python中使用多处理的频率更高.

Using separate processes has no such problems with the GIL, because each process has its own separate GIL. Of course you still have all the same tradeoffs between threads and processes as in any other languages—it's more difficult and more expensive to share data between processes than between threads, it can be costly to run a huge number of processes or to create and destroy them frequently, etc. But the GIL weighs heavily on the balance toward processes, in a way that isn't true for, say, C or Java. So, you will find yourself using multiprocessing a lot more often in Python than you would in C or Java.

与此同时,Python的包括电池"理念带来了一些好消息:编写代码可以很容易地进行一次一线的更改,从而可以在线程和进程之间来回切换.

Meanwhile, Python's "batteries included" philosophy brings some good news: It's very easy to write code that can be switched back and forth between threads and processes with a one-liner change.

如果您根据独立的工作"设计代码,除了输入和输出之外,这些工作与其他工作(或主程序)不共享任何内容,则可以使用

If you design your code in terms of self-contained "jobs" that don't share anything with other jobs (or the main program) except input and output, you can use the concurrent.futures library to write your code around a thread pool like this:

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    executor.submit(job, argument)
    executor.map(some_function, collection_of_independent_things)
    # ...

您甚至可以获取这些作业的结果,并将其传递给其他作业,按执行顺序或完成顺序等待;等等.有关详细信息,请阅读Future对象部分.

You can even get the results of those jobs and pass them on to further jobs, wait for things in order of execution or in order of completion, etc.; read the section on Future objects for details.

现在,如果事实证明您的程序一直在使用100%CPU,并且添加更多线程只会使其速度变慢,那么您就遇到了GIL问题,因此您需要切换到进程.您要做的就是更改第一行:

Now, if it turns out that your program is constantly using 100% CPU, and adding more threads just makes it slower, then you're running into the GIL problem, so you need to switch to processes. All you have to do is change that first line:

with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:

唯一真正的警告是,作业的自变量和返回值必须可腌制(而不需要花费太多时间或内存来腌制)才能使用跨进程.通常这不是问题,但有时是问题.

The only real caveat is that your jobs' arguments and return values have to be pickleable (and not take too much time or memory to pickle) to be usable cross-process. Usually this isn't a problem, but sometimes it is.

但是,如果您的工作不能自给自足怎么办?如果您可以根据将消息从一个传递到另一个的工作来设计代码,那仍然很容易.您可能必须使用threading.Threadmultiprocessing.Process而不是依赖于池.而且,您将必须显式创建queue.Queuemultiprocessing.Queue对象. (还有很多其他选择,例如管道,套接字,带有羊群的文件……等等,但要点是,如果执行器的自动魔力不足,则必须手动执行 .)

But what if your jobs can't be self-contained? If you can design your code in terms of jobs that pass messages from one to another, it's still pretty easy. You may have to use threading.Thread or multiprocessing.Process instead of relying on pools. And you will have to create queue.Queue or multiprocessing.Queue objects explicitly. (There are plenty of other options—pipes, sockets, files with flocks, … but the point is, you have to do something manually if the automatic magic of an Executor is insufficient.)

但是,如果您甚至不能依靠消息传递怎么办?如果您需要两个工作来同时改变同一个结构并看到彼此的更改,该怎么办?在这种情况下,您将需要进行手动同步(锁定,信号量,条件等),并且,如果要使用进程,则需要显式的共享内存对象来引导.这是当多线程(或多处理)变得困难时.如果可以避免,那就太好了;如果不能,您将需要阅读的内容超过某人可以提供的SO答案.

But what if you can't even rely on message passing? What if you need two jobs to both mutate the same structure, and see each others' changes? In that case, you will need to do manual synchronization (locks, semaphores, conditions, etc.) and, if you want to use processes, explicit shared-memory objects to boot. This is when multithreading (or multiprocessing) gets difficult. If you can avoid it, great; if you can't, you will need to read more than someone can put into an SO answer.

在评论中,您想了解Python中的线程和进程之间的区别.的确,如果您阅读了朱利奥·佛朗哥的答案和我的知识以及我们所有的链接,那应该涵盖了所有内容...但是总结肯定会很有用,所以这里是:

From a comment, you wanted to know what's different between threads and processes in Python. Really, if you read Giulio Franco's answer and mine and all of our links, that should cover everything… but a summary would definitely be useful, so here goes:

  1. 默认情况下,线程共享数据;流程没有.
  2. (1)的结果是,在进程之间发送数据通常需要对其进行酸洗和酸洗. **
  3. (1)的另一个结果,在进程之间直接共享数据通常需要将其放入低级格式,例如Value,Array和ctypes类型.
  4. 进程不受GIL约束.
  5. 在某些平台(主要是Windows)上,创建和销毁进程的成本要高得多.
  6. 对流程有一些额外的限制,其中某些限制在不同平台上有所不同.有关详细信息,请参见编程指南.
  7. threading模块不具有multiprocessing模块的某些功能. (您可以使用multiprocessing.dummy在线程顶部获取大多数缺少的API,或者可以使用诸如concurrent.futures的高级模块,而不必担心.)
  1. Threads share data by default; processes do not.
  2. As a consequence of (1), sending data between processes generally requires pickling and unpickling it.**
  3. As another consequence of (1), directly sharing data between processes generally requires putting it into low-level formats like Value, Array, and ctypes types.
  4. Processes are not subject to the GIL.
  5. On some platforms (mainly Windows), processes are much more expensive to create and destroy.
  6. There are some extra restrictions on processes, some of which are different on different platforms. See Programming guidelines for details.
  7. The threading module doesn't have some of the features of the multiprocessing module. (You can use multiprocessing.dummy to get most of the missing API on top of threads, or you can use higher-level modules like concurrent.futures and not worry about it.)


*出现此问题的实际上不是Python语言,而是该语言的标准"实现CPython.其他一些实现没有GIL,例如Jython.

**如果您使用的是 fork 用于多处理的启动方法(在大多数非Windows平台上可以使用),每个子进程都可以获取启动子级时父级拥有的任何资源,这可以是将数据传递给子级的另一种方式.

** If you're using the fork start method for multiprocessing—which you can on most non-Windows platforms—each child process gets any resources the parent had when the child was started, which can be another way to pass data to children.

这篇关于线程和多处理模块之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆