pandas ,并发.未来与GIL [英] Pandas, Concurrent.Futures and the GIL

查看:66
本文介绍了 pandas ,并发.未来与GIL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在intel i3(四个内核)上使用Pandas 0.18/Python 3.5编写代码.

I'm writing code using Pandas 0.18/Python 3.5 on an intel i3 (four cores).

我已阅读以下内容: https://www.continuum.io/content/pandas-releasing-gil

I have read this: https://www.continuum.io/content/pandas-releasing-gil

我还有一些与IO绑定的工作(将CSV文件解析为数据帧). 我必须做很多计算,大部分是将数据框相乘.

I also have some work that is IO bound (parsing CSV files into dataframes). I have to do a lot of calculation that is mostly multiplying dataframes.

我的代码当前正在使用concurrent.futures ThreadPoolExecutor并行.

My code is currently parallel using concurrent.futures ThreadPoolExecutor.

我的问题是:

  • 通常,我应该使用线程并行运行pandas作业,还是pandas可以有效利用所有内核,而无需我明确告知呢? (在这种情况下,我将按顺序执行作业).

推荐答案

从阅读文档中我可以看出,熊猫

Best I can tell from reading the docs, pandas simply releases the GIL for certain operations:

我们将在某些cython上释放全局解释器锁(GIL) 操作.这将允许其他线程在运行期间同时运行 计算,可能会提高性能 多线程.尤其是groupbynsmallestvalue_counts和一些 索引操作将从中受益.

We are releasing the global-interpreter-lock (GIL) on some cython operations. This will allow other threads to run simultaneously during computation, potentially allowing performance improvements from multi-threading. Notably groupby, nsmallest, value_counts and some indexing operations benefit from this.

这意味着在Python解释器继续执行其他线程的同时,其他线程也可以由Python解释器执行.这并不意味着pandas会自动在许多线程之间扩展计算范围.他们也在文档中提到了这一点:

All this means is that other threads can be executed by the Python interpreter while the calculations being one by pandas continue. It doesn't mean that pandas automatically scales the calculations across many threads. They sort of mention this in the docs as well:

释放GIL可能会使使用线程的应用程序受益 用于用户交互(例如QT)或执行多线程 计算.

Releasing of the GIL could benefit an application that uses threads for user interactions (e.g. QT), or performing multi-threaded computations.

为了获得并行化的好处,您实际上需要在自己的代码中创建并执行多个线程.因此,如果您要在应用程序中尝试并行执行,则应该继续使用ThreadPoolExecutor.

In order to get parallelization benefits, you need to actually be creating and executing multiple threads in your own code. So, you should continue using the ThreadPoolExecutor if you're trying to get parallel execution in your application.

请记住,熊猫只为 some 操作释放GIL,因此,如果您不调用任何实际释放GIL的方法,则可能无法获得多线程的性能提升.

Keep in mind that pandas is only releasing the GIL for some operations, so you may not get performance improvements with multiple threads if you're not calling any methods that actually release it.

这篇关于 pandas ,并发.未来与GIL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆