Pandas:组合数据帧的有效方法 [英] Pandas: efficient way to combine dataframes

查看：61 发布时间：2021/6/15 19:54:44 python pandas performance split-apply-combine

本文介绍了Pandas:组合数据帧的有效方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一种比 pd.concat 更有效的方法来组合两个 Pandas DataFrames.

I'm looking for a more efficient way than pd.concat to combine two pandas DataFrames.

我有一个大型 DataFrame(大小约 7GB)，其中包含以下列 - A"、B"、C"、D".我想按A"对框架进行分组，然后对于每个组:groupby 按B"，对C"求平均，对D"求和，然后将所有结果合并到一个数据帧中.我尝试了以下方法 -

I have a large DataFrame (~7GB in size) with the following columns - "A", "B", "C", "D". I want to groupby the frame by "A", then for each group: groupby by "B", average the "C" and sum the "D" and then combine all the results to one dataframe. I've tried the following approaches -

1) 创建一个空的最终 DataFrame，迭代A"的 groupby 做我需要的处理，然后 pd.concat 每个组都是最终的 DataFrame.问题是 pd.concat 非常慢.

1) Creating an empty final DataFrame, Iterating the groupby of "A" doing the processing I need and than pd.concat each group the the final DataFrame. The problem is that pd.concat is extremely slow.

2) 遍历A"的 groupby，进行我需要的处理，然后将结果保存到 csv 文件.这工作正常，但我想知道是否有更有效的方法，不涉及写入磁盘的所有 I/O.

2) Iterating through the groupby of "A", doing the processing I needed and than saving the result to a csv file. That's working ok but I want to find out if there is a more efficient way that doesn't involve all the I/O of writing to disk.

代码示例

第一种方法 - 使用 pd.concat 的最终 DataFrame:

First approach - Final DataFrame with pd.concat:

def pivot_frame(in_df_path):
    in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
    res_cols = in_df.columns.tolist()
    res = pd.DataFrame(columns=res_cols)
    g = in_df.groupby(by=["A"])
    for title, group in g:
        temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
        temp = temp.reset_index()
        temp.insert(0, "A", title)
        res = pd.concat([res, temp], ignore_index=True)
        temp.to_csv(f, mode='a', header=False, sep=DELIMITER)
    return res

第二种方法 - 写入磁盘:

Second approach - Writing to disk:

def pivot_frame(in_df_path, ouput_path):
    in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
    with open(ouput_path, 'w') as f:
        csv_writer = csv.writer(f, delimiter=DELIMITER)
        csv_writer.writerow(["A", "B", "C", "D"])
        g = in_df.groupby(by=["A"])
        for title, group in g:
            temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
            temp = temp.reset_index()
            temp.insert(0, JOB_TITLE_COL, title)
            temp.to_csv(f, mode='a', header=False, sep=DELIMITER)

第二种方法比第一种方法更快，但我正在寻找一种可以让我始终免于访问磁盘的方法.我阅读了 split-apply-combine(例如 - https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) 但我觉得它没有帮助.

The second approach works way faster than the first one but I'm looking for something that would spare me the access to disk all the time. I read about split-apply-combine (e.g. - https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) but I haven't found it helpful.

非常感谢！:)

Pandas:组合数据帧的有效方法 [英] Pandas: efficient way to combine dataframes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pandas:组合数据帧的有效方法 [英] Pandas: efficient way to combine dataframes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭