如何利用一台机器上的所有内核并行化Pandas Dataframe上的apply()? [英] How do you parallelize apply() on Pandas Dataframes making use of all cores on one machine?

查看:131
本文介绍了如何利用一台机器上的所有内核并行化Pandas Dataframe上的apply()?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

自2017年8月起,Pandas DataFame.apply( )仍然仅限于使用单核,这意味着当您运行df.apply(myfunc, axis=1)时,多核计算机将浪费其大部分计算时间.

As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1).

如何使用所有核心并行运行一个数据帧上的Apply?

How can you use all your cores to run apply on a dataframe in parallel?

推荐答案

您可以使用 swifter 包裹:

You may use the swifter package:

pip install swifter

它用作熊猫的插件,允许您重用apply函数:

It works as a plugin for pandas, allowing you to reuse the apply function:

import swifter

def some_function(data):
    return data * 10

data['out'] = data['in'].swifter.apply(some_function)

无论是否进行矢量化(如上例所示),它都会自动找出最有效的函数并行化方式.

It will automatically figure out the most efficient way to parallelize the function, no matter if it's vectorized (as in the above example) or not.

更多示例

More examples and a performance comparison are available on GitHub. Note that the package is under active development, so the API may change.

还请注意,此不会自动运行用于字符串列.当使用字符串时,Swifter将回退到一个简单"的熊猫apply,它将不会是并行的.在这种情况下,即使强迫它使用dask也不会提高性能,您最好手动手动分割数据集并

Also note that this will not work automatically for string columns. When using strings, Swifter will fallback to a "simple" Pandas apply, which will not be parallel. In this case, even forcing it to use dask will not create performance improvements, and you would be better off just splitting your dataset manually and parallelizing using multiprocessing.

这篇关于如何利用一台机器上的所有内核并行化Pandas Dataframe上的apply()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆