大 pandas 数据框并行处理 [英] Large Pandas Dataframe parallel processing

查看:65
本文介绍了大 pandas 数据框并行处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在访问一个非常大的Pandas数据框作为全局变量.可通过 joblib 并行访问此变量.

I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib.

例如.

df = db.query("select id, a_lot_of_data from table")

def process(id):
    temp_df = df.loc[id]
    temp_df.apply(another_function)

Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list())

以这种方式访问​​原始df似乎可以跨进程复制数据.这是出乎意料的,因为原始的df是否在任何子过程中都没有改变? (或者是吗?)

Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)

推荐答案

对于Joblib创建的每个进程,都需要对整个DataFrame进行酸洗和酸洗.在实践中,这非常慢,并且每个过程都需要很多倍的内存.

The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice, this is very slow and also requires many times the memory of each.

一种解决方案是使用表格式将数据存储在HDF(df.to_hdf)中.然后,您可以使用select选择数据子集以进行进一步处理.实际上,这对于交互式使用来说太慢了.这也非常复杂,您的员工将需要存储他们的工作,以便可以在最后一步将其合并.

One solution is to store your data in HDF (df.to_hdf) using the table format. You can then use select to select subsets of data for further processing. In practice this will be too slow for interactive use. It is also very complex, and your workers will need to store their work so that it can be consolidated in the final step.

一种替代方法是使用target='parallel'探索numba.vectorize.这将需要使用NumPy数组而不是Pandas对象,因此也有一些复杂性代价.

An alternative would be to explore numba.vectorize with target='parallel'. This would require the use of NumPy arrays not Pandas objects, so it also has some complexity costs.

从长远来看,希望 dask 可以将并行执行带入Pandas,但这不是必需的期待很快.

In the long run, dask is hoped to bring parallel execution to Pandas, but this is not something to expect soon.

这篇关于大 pandas 数据框并行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆