python中df的并行应用函数 [英] Paralle apply function on df in python
本文介绍了python中df的并行应用函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个函数可以遍历两个列表:项目和日期.该函数返回更新的项目列表.现在它使用 apply 运行,这在数百万行上效率不高.我想通过并行化来提高效率.
I have a function that go over two lists: items and dates. The function return an updated list of items. For now it runs with apply which is not that efficent on million of rows. I want to make it more efficient by parallelizing it.
项目列表中的项目按时间顺序排列,以及对应的日期列表(item_list 和 date_list 大小相同).
Items in item list are on chronological order, as well as the corresponding date list (item_list and date_list are the same size).
这是df:
Date item_list date_list
12/05/20 [I1,I3,I4] [10/05/20, 11/05/20, 12/05/20 ]
11/05/20 [I1,I3] [11/05/20 , 14/05/20]
这是我想要的df:
Date item_list date_list items_list_per_date
12/05/20 [I1,I3,I4] [10/05/20, 11/05/20, 12/05/20] [I1,I3]
11/05/20 [I1,I3] [11/05/20 , 14/05/20] nan
这是我的代码:
def get_item_list_per_date(date, items_list, date_list):
if str(items_list)=="nan" or str(date_list)=="nan":
return np.nan
new_date_list = []
for d in list(date_list):
new_date_list.append(pd.to_datetime(d))
if (date in new_date_list) and (len(new_date_list)>1):
loc = new_date_list.index(date)
else:
return np.nan
updated_items_list = items_list[:loc]
if len(updated_items_list )==0:
return np.nan
return updated_items_list
df['items_list_per_date'] = df.progress_apply(lambda x: get_item_list_per_date(date=x['date'], items_list=x['items_list'], date_list=x['date_list']),axis=1)
我很想将它并行化,你能帮忙吗?
I would love to parallelized it of possible, can you help?
推荐答案
使用:
import multiprocessing as mp
def fx(df):
def __fx(s):
date = s['Date']
date_list = s['date_list']
if date in date_list:
loc = date_list.index(date)
return s['item_list'][:loc]
else:
return np.nan
return df.apply(__fx, axis=1)
def parallel_apply(df):
dfs = filter(lambda d: not d.empty, np.array_split(df, mp.cpu_count()))
pool = mp.Pool()
per_date = pd.concat(pool.map(fx, dfs))
pool.close()
pool.join()
return per_date
df['items_list_per_date'] = parallel_apply(df)
结果:
#print(df)
Date item_list date_list items_list_per_date
12/05/20 [I1,I3,I4] [10/05/20, 11/05/20, 12/05/20] [I1,I3]
11/05/20 [I1,I3] [11/05/20 , 14/05/20] nan
这篇关于python中df的并行应用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文