Pandas:组合数据帧的有效方法 [英] Pandas: efficient way to combine dataframes

查看:61
本文介绍了Pandas:组合数据帧的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种比 pd.concat 更有效的方法来组合两个 Pandas DataFrames.

I'm looking for a more efficient way than pd.concat to combine two pandas DataFrames.

我有一个大型 DataFrame(大小约 7GB),其中包含以下列 - A"、B"、C"、D".我想按A"对框架进行分组,然后对于每个组:groupby 按B",对C"求平均,对D"求和,然后将所有结果合并到一个数据帧中.我尝试了以下方法 -

I have a large DataFrame (~7GB in size) with the following columns - "A", "B", "C", "D". I want to groupby the frame by "A", then for each group: groupby by "B", average the "C" and sum the "D" and then combine all the results to one dataframe. I've tried the following approaches -

1) 创建一个空的最终 DataFrame,迭代A"的 groupby 做我需要的处理,然后 pd.concat 每个组都是最终的 DataFrame.问题是 pd.concat 非常慢.

1) Creating an empty final DataFrame, Iterating the groupby of "A" doing the processing I need and than pd.concat each group the the final DataFrame. The problem is that pd.concat is extremely slow.

2) 遍历A"的 groupby,进行我需要的处理,然后将结果保存到 csv 文件.这工作正常,但我想知道是否有更有效的方法,不涉及写入磁盘的所有 I/O.

2) Iterating through the groupby of "A", doing the processing I needed and than saving the result to a csv file. That's working ok but I want to find out if there is a more efficient way that doesn't involve all the I/O of writing to disk.

代码示例

第一种方法 - 使用 pd.concat 的最终 DataFrame:

First approach - Final DataFrame with pd.concat:

def pivot_frame(in_df_path):
    in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
    res_cols = in_df.columns.tolist()
    res = pd.DataFrame(columns=res_cols)
    g = in_df.groupby(by=["A"])
    for title, group in g:
        temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
        temp = temp.reset_index()
        temp.insert(0, "A", title)
        res = pd.concat([res, temp], ignore_index=True)
        temp.to_csv(f, mode='a', header=False, sep=DELIMITER)
    return res

第二种方法 - 写入磁盘:

Second approach - Writing to disk:

def pivot_frame(in_df_path, ouput_path):
    in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
    with open(ouput_path, 'w') as f:
        csv_writer = csv.writer(f, delimiter=DELIMITER)
        csv_writer.writerow(["A", "B", "C", "D"])
        g = in_df.groupby(by=["A"])
        for title, group in g:
            temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
            temp = temp.reset_index()
            temp.insert(0, JOB_TITLE_COL, title)
            temp.to_csv(f, mode='a', header=False, sep=DELIMITER)

第二种方法比第一种方法更快,但我正在寻找一种可以让我始终免于访问磁盘的方法.我阅读了 split-apply-combine(例如 - https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) 但我觉得它没有帮助.

The second approach works way faster than the first one but I'm looking for something that would spare me the access to disk all the time. I read about split-apply-combine (e.g. - https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) but I haven't found it helpful.

非常感谢!:)

推荐答案

已解决

所以 Niels Henkens 的评论真的很有帮助,解决方案是 -

So Niels Henkens comment really helped and the solution is to just -

result = in_df.groupby(by=["A","B"]).agg({"C": np.mean, "D": np.sum})

性能的另一个改进是使用 Dask -

Another improvement in performance is to use Dask -

import dask.dataframe as dd
df = dd.read_csv(PATH_TO_FILE, delimiter=DELIMITER)
g = df.groupby(by=["A", "B"]).agg({"C": np.mean, "D": np.sum}).compute().reset_index()

这篇关于Pandas:组合数据帧的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆