如何有效地组合 pandas 数据框 [英] How to efficiently combine pandas dataframes

查看:61
本文介绍了如何有效地组合 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • 我有 2 个数据帧,df_oth 和 df_small.
  • ID"列唯一标识 df_oth 中的每一行.
  • 另一方面,在 df_small 中,每个 ID 可能会出现多次.

我正在努力

  • 为每个 ID 从 df_small 中提取一些列的内容
  • 将它们转换为列表并包装在字典中
  • 最后,存储在 df_oth 上的新列中,对应 ID 下.

在我的第一次迭代中,我将 df_row 分配给 df_oth 上的相应单元格,但这太慢了.然后,我修改了如下代码以将组合值存储在临时数据帧中,然后在最后推送到 df_oth.它变得更快了一点,但仍然,每个 1K 操作仍然需要大约 4 秒,我有大约 100 万个唯一 ID.所以,我真的很感激一些关于如何更快地做到这一点的建议?使用并行化或其他库(如 Dask 等)不是一种选择,因此,我必须坚持使用 Pandas.

In my first iteration, I was assigning df_row to the corresponding cell on df_oth but this was too slow. Then, I modified the code as below to store the combined values in a temporary dataframe and then push to df_oth at the end. It got a bit quicker but still,, each 1K operation still takes roughly 4s and I have around 1M unique IDs. So, I'd really appreciate some pointers regarding how to do this quicker? Using parallelisation or another library like Dask etc, isn't an option, so, I'll have to stick to Pandas.

sum_t1, sum_t2 = 0,0
ratio = 1000
df_oth.set_index('ID')
df_oth.sort_index()
df_temp = pd.DataFrame(columns=['ID', 'newcol'])
df_temp.set_index('ID')
grps = df_small.groupby('ID')
idx = 0
for grp, frame in grps:
    s1 = time.time()
    idx += 1
    id_no = frame.iloc[0, frame.columns.get_loc('ID')]
    df_row = pd.DataFrame({'ID': id_no, 
                           'newcol': [
                                       {'C1_Arr': frame['C1'], 
                                       'C2_Arr': frame['C2']}
                                      ]})
    s2 = time.time()
    df_temp = df_temp.append(df_row, ignore_index=True)
    t1, t2 = (s2 - s1), (time.time() -s2)
    sum_t1 += t1
    sum_t2 += t2
    if idx % ratio == 0:
        print(f'{idx}: {ser_no} - {sum_t1} - {sum_t2} - {sum_t1 / sum_t2}')
        sum_t1, sum_t2 = 0,0\
df_temp.sort_index()
df_oth = pd.merge(df_oth, df_temp, on='ID')

推荐答案

好吧,虽然试错了很多,但这里有经验教训

OK,it took a lot of trial and error but here are the lessons learnt

  • 不是在每次迭代时将行推送到数据框,而是将行添加到列表中,然后将行附加到 df 中的末尾
  • 任何通过 []、loc 和 iloc 的引用都非常昂贵,因此,与其通过 frame['Cx'] 提取列,iloc[:, n:] 效果更好.可以在循环之前更改列顺序,以使所需的列与一侧对齐.
  • reset_indexdrop 这样的操作对每个循环的 group 元素也很昂贵.
  • Instead of pushing rows to a data frame on each iteration, add rows to a list and append to the df in one gone right at the end
  • Any referencing via [], loc and iloc are quite expensive, so, instead of extracting the columns byframe['Cx'], iloc[:, n:] works much better. The column order can be changed right before the loop to keep the desired ones aligned to one side.
  • Operations like reset_index and drop on the group element on each loop are expensive too.

我没有完整的统计数据,但经过这些简单的修改后,运行时间从预计的 ~1+h 变为 2 分钟.

I don't have the full stats but after these simple modifications, running time comes from projected ~1+h to 2 minutes.

temp_lst = list()
df_tmp = pd.DataFrame(columns=['ID', 'newcol'])
grps = df_small.groupby('ID')
for grp_name, frame in grps:
    temp_lst.append({'ID': grp_name, 'newcol': 
        list(frame.iloc[:, 1:].T.to_dict().values())})
df_tmp = df_temp.append(df_row, ignore_index=True)
df_oth = df_oth.merge(df_temp, how='left', on='ID')

这篇关于如何有效地组合 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆