在Pandas数据帧上并行化操作时速度较慢 [英] Slow speed while parallelizing operation on pandas dataframe

查看:75
本文介绍了在Pandas数据帧上并行化操作时速度较慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,可以对其执行一些操作并打印出来.为此,我必须遍历每一行.

I have a dataframe which I perform some operation on and print out. To do this, I have to iterate through each row.

for count, row in final_df.iterrows():
    x = row['param_a']
    y = row['param_b']
    # Perform operation
    # Write to output file

我决定使用python多处理模块对此进行并行化

I decided to parallelize this using the python multiprocessing module

def write_site_files(row):
    x = row['param_a']
    y = row['param_b']
    # Perform operation
    # Write to output file

pkg_num = 0
total_runs = final_df.shape[0] # Total number of rows in final_df
threads = []

import multiprocessing

while pkg_num < total_runs or len(threads):
    if(len(threads) < num_proc and pkg_num < total_runs):
        print pkg_num, total_runs
        t = multiprocessing.Process(target=write_site_files,args=[final_df.iloc[pkg_num],pkg_num])
        pkg_num = pkg_num + 1
        t.start()
        threads.append(t)
    else:
        for thread in threads:
            if not thread.is_alive():
               threads.remove(thread)

但是,后一种(并行化)方法比基于简单迭代的方法要慢得多.有什么我想念的吗?

However, the latter (parallelized) method is way slower than the simple iteration based approach. Is there anything I am missing?

谢谢!

推荐答案

除非在实际操作中花费大量时间(例如秒每行.

This will be way less efficient that doing this in a single process unless the actual operation take a lot of time, like seconds per row.

通常并行化是框中的最后一个工具.分析后,进行局部向量化后,进行局部优化后,然后进行并行化.

Normally parallelization is the last tool in the box. After profiling, after local vectorization, after local optimization, then you parallelize.

您正在花费时间只是在进行切片,然后旋转新的进程(通常是固定的开销),然后对一行进行酸洗(不清楚示例中的行数).

You are spending time just doing the slicing, then spinning up new processes (which is generally a constant overhead), then pickling a single row (not clear how big it is from your example).

至少,您应该对行进行分块,例如df.iloc[i:(i+1)*chunksize].

At the very least, you should chunk the rows, e.g. df.iloc[i:(i+1)*chunksize].

希望对0.14中的并行apply有一些支持,请参见此处: https://github.com/pydata/pandas/issues/5751

There hopefully will be some support for parallel apply in 0.14, see here: https://github.com/pydata/pandas/issues/5751

这篇关于在Pandas数据帧上并行化操作时速度较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆