大 pandas 中的groupby会创建数据的副本还是只是视图? [英] Does groupby in pandas create a copy of the data or just a view?

查看:77
本文介绍了大 pandas 中的groupby会创建数据的副本还是只是视图?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

pandas.DataFrame.groupby是创建数据的副本还是仅创建视图?在(不大可能)不创建副本的情况下,额外的内存开销是多少?它如何与原始数据帧特性(例如行数,列数,不同的组数)一起缩放?

Does pandas.DataFrame.groupby create a copy of the data or just a view? In the (more probable) case of not creating a copy, what is the additional memory overhead and how does it scale with the original dataframe chracteristics (e.g. number of rows, columns, distinct groups)?

推荐答案

The groupby code in pandas gets a bit complex so it's hard to find out from first principles. A quick test makes it seem like the memory use grows as the data grows and that more groups = more memory, but it doesn't appear to making a full copy or anything:

In [7]: df = pd.DataFrame(np.random.random((1000,5)))

In [8]: def ret_df(df):
   ...:     return df

In [9]: def ret_gb_df(df):
   ...:     return df, df.groupby(0).mean()

In [10]: %memit ret_df(df)
peak memory: 75.91 MiB, increment: 0.00 MiB

In [11]: %memit ret_gb_df(df)
peak memory: 75.96 MiB, increment: 0.05 MiB

In [12]: df = pd.DataFrame(np.random.random((100000,5)))

In [13]: %memit ret_df(df)
peak memory: 79.76 MiB, increment: -0.02 MiB

In [14]: %memit ret_gb_df(df)
peak memory: 94.88 MiB, increment: 15.12 MiB

In [15]: df = pd.DataFrame(np.random.random((1000000,5)))

In [16]: %memit ret_df(df)
peak memory: 113.98 MiB, increment: 0.01 MiB

In [17]: %memit ret_gb_df(df)
peak memory: 263.14 MiB, increment: 149.16 MiB

In [18]: df = pd.DataFrame(np.random.choice([0,1,2,3], (1000000, 5)))

In [19]: %memit ret_df(df)
peak memory: 95.34 MiB, increment: 0.00 MiB

In [20]: %memit ret_gb_df(df)
peak memory: 166.91 MiB, increment: 71.56 MiB

这篇关于大 pandas 中的groupby会创建数据的副本还是只是视图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆