在Pandas数据帧上并行化操作时速度较慢 [英] Slow speed while parallelizing operation on pandas dataframe
问题描述
我有一个数据框,可以对其执行一些操作并打印出来.为此,我必须遍历每一行.
I have a dataframe which I perform some operation on and print out. To do this, I have to iterate through each row.
for count, row in final_df.iterrows():
x = row['param_a']
y = row['param_b']
# Perform operation
# Write to output file
我决定使用python多处理模块对此进行并行化
I decided to parallelize this using the python multiprocessing module
def write_site_files(row):
x = row['param_a']
y = row['param_b']
# Perform operation
# Write to output file
pkg_num = 0
total_runs = final_df.shape[0] # Total number of rows in final_df
threads = []
import multiprocessing
while pkg_num < total_runs or len(threads):
if(len(threads) < num_proc and pkg_num < total_runs):
print pkg_num, total_runs
t = multiprocessing.Process(target=write_site_files,args=[final_df.iloc[pkg_num],pkg_num])
pkg_num = pkg_num + 1
t.start()
threads.append(t)
else:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
但是,后一种(并行化)方法比基于简单迭代的方法要慢得多.有什么我想念的吗?
However, the latter (parallelized) method is way slower than the simple iteration based approach. Is there anything I am missing?
谢谢!
推荐答案
除非在实际操作中花费大量时间(例如秒每行.
This will be way less efficient that doing this in a single process unless the actual operation take a lot of time, like seconds per row.
通常并行化是框中的最后一个工具.分析后,进行局部向量化后,进行局部优化后,然后进行并行化.
Normally parallelization is the last tool in the box. After profiling, after local vectorization, after local optimization, then you parallelize.
您正在花费时间只是在进行切片,然后旋转新的进程(通常是固定的开销),然后对一行进行酸洗(不清楚示例中的行数).
You are spending time just doing the slicing, then spinning up new processes (which is generally a constant overhead), then pickling a single row (not clear how big it is from your example).
至少,您应该对行进行分块,例如df.iloc[i:(i+1)*chunksize]
.
At the very least, you should chunk the rows, e.g. df.iloc[i:(i+1)*chunksize]
.
希望对0.14中的并行apply
有一些支持,请参见此处: https://github.com/pydata/pandas/issues/5751
There hopefully will be some support for parallel apply
in 0.14, see here: https://github.com/pydata/pandas/issues/5751
这篇关于在Pandas数据帧上并行化操作时速度较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!