Pandas:组合数据帧的有效方法 [英] Pandas: efficient way to combine dataframes
问题描述
我正在寻找一种比 pd.concat 更有效的方法来组合两个 Pandas DataFrames.
I'm looking for a more efficient way than pd.concat to combine two pandas DataFrames.
我有一个大型 DataFrame(大小约 7GB),其中包含以下列 - A"、B"、C"、D".我想按A"对框架进行分组,然后对于每个组:groupby 按B",对C"求平均,对D"求和,然后将所有结果合并到一个数据帧中.我尝试了以下方法 -
I have a large DataFrame (~7GB in size) with the following columns - "A", "B", "C", "D". I want to groupby the frame by "A", then for each group: groupby by "B", average the "C" and sum the "D" and then combine all the results to one dataframe. I've tried the following approaches -
1) 创建一个空的最终 DataFrame,迭代A"的 groupby 做我需要的处理,然后 pd.concat 每个组都是最终的 DataFrame.问题是 pd.concat 非常慢.
1) Creating an empty final DataFrame, Iterating the groupby of "A" doing the processing I need and than pd.concat each group the the final DataFrame. The problem is that pd.concat is extremely slow.
2) 遍历A"的 groupby,进行我需要的处理,然后将结果保存到 csv 文件.这工作正常,但我想知道是否有更有效的方法,不涉及写入磁盘的所有 I/O.
2) Iterating through the groupby of "A", doing the processing I needed and than saving the result to a csv file. That's working ok but I want to find out if there is a more efficient way that doesn't involve all the I/O of writing to disk.
代码示例
第一种方法 - 使用 pd.concat 的最终 DataFrame:
First approach - Final DataFrame with pd.concat:
def pivot_frame(in_df_path):
in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
res_cols = in_df.columns.tolist()
res = pd.DataFrame(columns=res_cols)
g = in_df.groupby(by=["A"])
for title, group in g:
temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
temp = temp.reset_index()
temp.insert(0, "A", title)
res = pd.concat([res, temp], ignore_index=True)
temp.to_csv(f, mode='a', header=False, sep=DELIMITER)
return res
第二种方法 - 写入磁盘:
Second approach - Writing to disk:
def pivot_frame(in_df_path, ouput_path):
in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
with open(ouput_path, 'w') as f:
csv_writer = csv.writer(f, delimiter=DELIMITER)
csv_writer.writerow(["A", "B", "C", "D"])
g = in_df.groupby(by=["A"])
for title, group in g:
temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
temp = temp.reset_index()
temp.insert(0, JOB_TITLE_COL, title)
temp.to_csv(f, mode='a', header=False, sep=DELIMITER)
第二种方法比第一种方法更快,但我正在寻找一种可以让我始终免于访问磁盘的方法.我阅读了 split-apply-combine(例如 - https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) 但我觉得它没有帮助.
The second approach works way faster than the first one but I'm looking for something that would spare me the access to disk all the time. I read about split-apply-combine (e.g. - https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) but I haven't found it helpful.
非常感谢!:)
推荐答案
已解决
所以 Niels Henkens 的评论真的很有帮助,解决方案是 -
So Niels Henkens comment really helped and the solution is to just -
result = in_df.groupby(by=["A","B"]).agg({"C": np.mean, "D": np.sum})
性能的另一个改进是使用 Dask -
Another improvement in performance is to use Dask -
import dask.dataframe as dd
df = dd.read_csv(PATH_TO_FILE, delimiter=DELIMITER)
g = df.groupby(by=["A", "B"]).agg({"C": np.mean, "D": np.sum}).compute().reset_index()
这篇关于Pandas:组合数据帧的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!