使Pandas DataFrame apply()使用所有内核? [英] Make Pandas DataFrame apply() use all cores?

查看:135
本文介绍了使Pandas DataFrame apply()使用所有内核?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

截至2017年8月,Pandas DataFame.apply( )仍然仅限于使用单核,这意味着当您运行 df.apply(myfunc,axis = 1时,多核计算机将浪费其大部分计算时间。 )

As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1).

如何使用所有核心并行运行在数据帧上?

How can you use all your cores to run apply on a dataframe in parallel?

推荐答案

您可以使用 swifter 软件包:

You may use the swifter package:

pip install swifter

它作为熊猫插件,可让您重复使用应用函数:

It works as a plugin for pandas, allowing you to reuse the apply function:

import swifter

def some_function(data):
    return data * 10

data['out'] = data['in'].swifter.apply(some_function)

无论是否进行矢量化(如上例所示),它都会自动找出最有效的函数并行化方式。

It will automatically figure out the most efficient way to parallelize the function, no matter if it's vectorized (as in the above example) or not.

更多示例和一个性能比较可用在GitHub上请注意,该软件包正在积极开发中,因此API可能会更改。

More examples and a performance comparison are available on GitHub. Note that the package is under active development, so the API may change.

还要注意,此不会自动用于字符串列。使用字符串时,Swifter将回退到简单的熊猫 apply 上,这不会并行。在这种情况下,即使强迫它使用 dask 也不会提高性能,您最好手动手动分割数据集并使用 multiprocessing 进行并行化。

Also note that this will not work automatically for string columns. When using strings, Swifter will fallback to a "simple" Pandas apply, which will not be parallel. In this case, even forcing it to use dask will not create performance improvements, and you would be better off just splitting your dataset manually and parallelizing using multiprocessing.

这篇关于使Pandas DataFrame apply()使用所有内核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆