如何利用一台机器上的所有内核并行化Pandas Dataframe上的apply()? [英] How do you parallelize apply() on Pandas Dataframes making use of all cores on one machine?
问题描述
自2017年8月起,Pandas DataFame.apply( )仍然仅限于使用单核,这意味着当您运行df.apply(myfunc, axis=1)
时,多核计算机将浪费其大部分计算时间.
As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1)
.
如何使用所有核心并行运行一个数据帧上的Apply?
How can you use all your cores to run apply on a dataframe in parallel?
推荐答案
您可以使用 swifter
包裹:
You may use the swifter
package:
pip install swifter
它用作熊猫的插件,允许您重用apply
函数:
It works as a plugin for pandas, allowing you to reuse the apply
function:
import swifter
def some_function(data):
return data * 10
data['out'] = data['in'].swifter.apply(some_function)
无论是否进行矢量化(如上例所示),它都会自动找出最有效的函数并行化方式.
It will automatically figure out the most efficient way to parallelize the function, no matter if it's vectorized (as in the above example) or not.
更多示例和
More examples and a performance comparison are available on GitHub. Note that the package is under active development, so the API may change.
还请注意,此不会自动运行用于字符串列.当使用字符串时,Swifter将回退到一个简单"的熊猫apply
,它将不会是并行的.在这种情况下,即使强迫它使用dask
也不会提高性能,您最好手动手动分割数据集并
Also note that this will not work automatically for string columns. When using strings, Swifter will fallback to a "simple" Pandas apply
, which will not be parallel. In this case, even forcing it to use dask
will not create performance improvements, and you would be better off just splitting your dataset manually and parallelizing using multiprocessing
.
这篇关于如何利用一台机器上的所有内核并行化Pandas Dataframe上的apply()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!