python中df的并行应用函数 [英] Paralle apply function on df in python

查看:57
本文介绍了python中df的并行应用函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个函数可以遍历两个列表:项目和日期.该函数返回更新的项目列表.现在它使用 apply 运行,这在数百万行上效率不高.我想通过并行化来提高效率.

I have a function that go over two lists: items and dates. The function return an updated list of items. For now it runs with apply which is not that efficent on million of rows. I want to make it more efficient by parallelizing it.

项目列表中的项目按时间顺序排列,以及对应的日期列表(item_list 和 date_list 大小相同).

Items in item list are on chronological order, as well as the corresponding date list (item_list and date_list are the same size).

这是df:

Date        item_list            date_list

12/05/20    [I1,I3,I4]    [10/05/20, 11/05/20, 12/05/20 ]
11/05/20    [I1,I3]       [11/05/20 , 14/05/20]

这是我想要的df:

Date        item_list     date_list             items_list_per_date  

12/05/20    [I1,I3,I4]    [10/05/20, 11/05/20, 12/05/20]   [I1,I3]
11/05/20    [I1,I3]       [11/05/20 , 14/05/20]               nan

这是我的代码:

def get_item_list_per_date(date, items_list, date_list):

    if str(items_list)=="nan" or str(date_list)=="nan":
        return np.nan

    new_date_list = []
    for d in list(date_list):
        new_date_list.append(pd.to_datetime(d))

    if (date in new_date_list) and (len(new_date_list)>1):
        loc = new_date_list.index(date)
    else:
        return np.nan

    updated_items_list = items_list[:loc]

    if len(updated_items_list )==0:
        return np.nan

    return updated_items_list 

df['items_list_per_date'] = df.progress_apply(lambda x: get_item_list_per_date(date=x['date'], items_list=x['items_list'], date_list=x['date_list']),axis=1)

我很想将它并行化,你能帮忙吗?

I would love to parallelized it of possible, can you help?

推荐答案

使用:

import multiprocessing as mp

def fx(df):
    def __fx(s):
        date = s['Date']
        date_list = s['date_list']
        if date in date_list:
            loc = date_list.index(date)
            return s['item_list'][:loc]
        else:
            return np.nan

    return df.apply(__fx, axis=1)

def parallel_apply(df):
    dfs = filter(lambda d: not d.empty, np.array_split(df, mp.cpu_count()))
    pool = mp.Pool()
    per_date = pd.concat(pool.map(fx, dfs))
    pool.close()
    pool.join()
    return per_date

df['items_list_per_date'] = parallel_apply(df)

结果:

#print(df)

Date        item_list     date_list             items_list_per_date  

12/05/20    [I1,I3,I4]    [10/05/20, 11/05/20, 12/05/20]   [I1,I3]
11/05/20    [I1,I3]       [11/05/20 , 14/05/20]               nan

这篇关于python中df的并行应用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆