pandas 集团通过内存释放 [英] Pandas GroupBy memory deallocation
问题描述
我注意到在遍历熊猫时分配的内存 GroupBy 对象在迭代后不会释放.我使用resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
(这篇文章中的第二个答案以获取详细信息)来衡量Python进程使用的活动内存总量.
I noticed that memory allocated while iterating through a Pandas GroupBy object is not deallocated after iteration. I use resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
(second answer in this post for details) to measure the total amount of active memory used by the Python process.
import resource
import gc
import pandas as pd
import numpy as np
i = np.random.choice(list(range(100)), 4000)
cols = list(range(int(2e4)))
df = pd.DataFrame(1, index=i, columns=cols)
gb = df.groupby(level=0)
# gb = list(gb)
for i in range(3):
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)
for idx, x in enumerate(gb):
if idx == 0:
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)
# del idx, x
# gc.collect()
打印以下总活动内存(以gb为单位)
prints the following total active memory (in gb)
0.671732
1.297424
1.297952
1.923288
1.923288
2.548624
解决方案
取消注释del idx, x
和gc.collect()
可以解决此问题.但是,我确实必须del
引用所有通过遍历groupby返回的DataFrames的变量(根据内部for循环中的代码,这可能会很麻烦).新的打印内存用法变为:
Solutions
Uncommenting del idx, x
and gc.collect()
fixes the problem. I do however have to del
all variables that reference the DataFrames returned by iterating over the groupby (which can be a pain depending on the code in the inner for loop). The new printed memory usages become:
0.671768
1.297412
1.297992
1.297992
1.297992
1.297992
或者,我可以取消注释gb = list(gb)
.产生的内存使用情况与先前解决方案中的使用情况大致相同:
Alternatively I can uncomment gb = list(gb)
. The resulting memory usages are roughly the same as those from the previous solution:
1.32874
1.32874
1.32874
1.32874
1.32874
1.32874
问题
- 为什么迭代完成后通过groupby迭代产生的DataFrames内存没有被释放?
- 有没有比上述两个更好的解决方案?如果不是,那么这两种解决方案中的哪一种更好"?
推荐答案
记忆怪异
这很有趣!您不需要del idx, x
.仅使用gc.collect()
可以为我保持内存恒定.与在循环内包含del
语句相比,这要干净得多.
Memory Weirdness
This is very interesting! You do not need del idx, x
. Only using gc.collect()
worked to keep memory constant for me. This is much cleaner that having the del
statements inside the loop.
这篇关于 pandas 集团通过内存释放的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!