如何释放 Pandas 数据帧使用的内存? [英] How do I release memory used by a pandas dataframe?

查看:237
本文介绍了如何释放 Pandas 数据帧使用的内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的 csv 文件,我在 Pandas 中打开了如下....

导入熊猫df = pandas.read_csv('large_txt_file.txt')

执行此操作后,我的内存使用量增加了 2GB,这是预期的,因为此文件包含数百万行.当我需要释放这段记忆时,我的问题就来了.我跑了....

del df

然而,我的内存使用量并没有下降.这是释放熊猫数据框使用的内存的错误方法吗?如果是,正确的方法是什么?

解决方案

在 Python 中减少内存使用很困难,因为 Python 实际上并没有将内存释放回操作系统.如果删除对象,则内存可用于新的 Python 对象,但不会 free() 回到系统(看这个问题).

如果你坚持使用数字 numpy 数组,它们会被释放,但装箱对象不会.

<预><代码>>>>import os, psutil, numpy as np # psutil 可能需要安装>>>定义用法():... process = psutil.Process(os.getpid())...返回 process.memory_info()[0]/float(2 ** 20)...>>>usage() # 初始内存使用情况27.5>>>arr = np.arange(10 ** 8) # 创建一个没有装箱的大数组>>>用法()790.46875>>>德尔阿尔>>>用法()27.52734375 # numpy 刚刚释放了数组>>>arr = np.arange(10 ** 8, dtype='O') # 创建大量对象>>>用法()3135.109375>>>德尔阿尔>>>用法()2372.16796875 # numpy 释放数组,但 python 保持堆大

减少数据帧的数量

Python 将我们的内存保持在高水位线,但我们可以减少我们创建的数据帧的总数.修改您的数据框时,更喜欢 inplace=True,这样您就不会创建副本.

另一个常见问题是在 ipython 中保留先前创建的数据帧的副本:

In [1]: import pandas as pd在 [2]: df = pd.DataFrame({'foo': [1,2,3,4]})在 [3] 中:df + 1出[3]:富0 21 32 43 5在 [4] 中:df + 2出[4]:富0 31 42 53 6In [5]: Out # 仍然有我们所有的临时 DataFrame 对象!出[5]:{3:富0 21 32 43 5, 4: 富0 31 42 53 6}

您可以通过键入 %reset Out 清除历史记录来解决此问题.或者,您可以使用 ipython --cache-size=5(默认为 1000)调整 ipython 保留多少历史记录.

减小数据帧大小

尽可能避免使用对象数据类型.

<预><代码>>>>df.dtypesfoo float64 # 每个值 8 个字节bar int64 # 每个值 8 个字节baz object # 每个值至少 48 字节,通常更多

具有对象 dtype 的值被装箱,这意味着 numpy 数组只包含一个指针,并且对于数据帧中的每个值,堆上都有一个完整的 Python 对象.这包括字符串.

虽然 numpy 支持数组中的固定大小的字符串,但 Pandas 不支持(它导致用户混淆).这可能会产生重大影响:

<预><代码>>>>将 numpy 导入为 np>>>arr = np.array(['foo', 'bar', 'baz'])>>>类型dtype('S3')>>>arr.nbytes9>>>导入系统;将熊猫导入为 pd>>>s = pd.Series(['foo', 'bar', 'baz'])dtype('O')>>>sum(sys.getsizeof(x) for x in s)120

您可能希望避免使用字符串列,或者想办法将字符串数据表示为数字.

如果你有一个包含许多重复值的数据框(NaN 很常见),那么你可以使用 稀疏数据结构以减少内存使用:

<预><代码>>>>df1.info()<class 'pandas.core.frame.DataFrame'>Int64Index:39681584 个条目,0 到 39681583数据列(共1列):foo float64数据类型:float64(1)内存使用:605.5 MB>>>df1.shape(39681584, 1)>>>df1.foo.isnull().sum() * 100./len(df1)20.628483479893344 # 所以 20% 的值是 NaN>>>df1.to_sparse().info()<class 'pandas.sparse.frame.SparseDataFrame'>Int64Index:39681584 个条目,0 到 39681583数据列(共1列):foo float64数据类型:float64(1)内存使用:543.0 MB

查看内存使用情况

您可以查看内存使用情况(docs):

<预><代码>>>>df.info()<class 'pandas.core.frame.DataFrame'>Int64Index:39681584 个条目,0 到 39681583数据列(共14列):...数据类型:datetime64[ns](1)、float64(8)、int64(1)、object(4)内存使用:4.4+ GB

从 pandas 0.17.1 开始,您还可以执行 df.info(memory_usage='deep') 以查看内存使用情况,包括对象.

I have a really large csv file that I opened in pandas as follows....

import pandas
df = pandas.read_csv('large_txt_file.txt')

Once I do this my memory usage increases by 2GB, which is expected because this file contains millions of rows. My problem comes when I need to release this memory. I ran....

del df

However, my memory usage did not drop. Is this the wrong approach to release memory used by a pandas data frame? If it is, what is the proper way?

解决方案

Reducing memory usage in Python is difficult, because Python does not actually release memory back to the operating system. If you delete objects, then the memory is available to new Python objects, but not free()'d back to the system (see this question).

If you stick to numeric numpy arrays, those are freed, but boxed objects are not.

>>> import os, psutil, numpy as np # psutil may need to be installed
>>> def usage():
...     process = psutil.Process(os.getpid())
...     return process.memory_info()[0] / float(2 ** 20)
... 
>>> usage() # initial memory usage
27.5 

>>> arr = np.arange(10 ** 8) # create a large array without boxing
>>> usage()
790.46875
>>> del arr
>>> usage()
27.52734375 # numpy just free()'d the array

>>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects
>>> usage()
3135.109375
>>> del arr
>>> usage()
2372.16796875  # numpy frees the array, but python keeps the heap big

Reducing the Number of Dataframes

Python keep our memory at high watermark, but we can reduce the total number of dataframes we create. When modifying your dataframe, prefer inplace=True, so you don't create copies.

Another common gotcha is holding on to copies of previously created dataframes in ipython:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'foo': [1,2,3,4]})

In [3]: df + 1
Out[3]: 
   foo
0    2
1    3
2    4
3    5

In [4]: df + 2
Out[4]: 
   foo
0    3
1    4
2    5
3    6

In [5]: Out # Still has all our temporary DataFrame objects!
Out[5]: 
{3:    foo
 0    2
 1    3
 2    4
 3    5, 4:    foo
 0    3
 1    4
 2    5
 3    6}

You can fix this by typing %reset Out to clear your history. Alternatively, you can adjust how much history ipython keeps with ipython --cache-size=5 (default is 1000).

Reducing Dataframe Size

Wherever possible, avoid using object dtypes.

>>> df.dtypes
foo    float64 # 8 bytes per value
bar      int64 # 8 bytes per value
baz     object # at least 48 bytes per value, often more

Values with an object dtype are boxed, which means the numpy array just contains a pointer and you have a full Python object on the heap for every value in your dataframe. This includes strings.

Whilst numpy supports fixed-size strings in arrays, pandas does not (it's caused user confusion). This can make a significant difference:

>>> import numpy as np
>>> arr = np.array(['foo', 'bar', 'baz'])
>>> arr.dtype
dtype('S3')
>>> arr.nbytes
9

>>> import sys; import pandas as pd
>>> s = pd.Series(['foo', 'bar', 'baz'])
dtype('O')
>>> sum(sys.getsizeof(x) for x in s)
120

You may want to avoid using string columns, or find a way of representing string data as numbers.

If you have a dataframe that contains many repeated values (NaN is very common), then you can use a sparse data structure to reduce memory usage:

>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 605.5 MB

>>> df1.shape
(39681584, 1)

>>> df1.foo.isnull().sum() * 100. / len(df1)
20.628483479893344 # so 20% of values are NaN

>>> df1.to_sparse().info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 543.0 MB

Viewing Memory Usage

You can view the memory usage (docs):

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 14 columns):
...
dtypes: datetime64[ns](1), float64(8), int64(1), object(4)
memory usage: 4.4+ GB

As of pandas 0.17.1, you can also do df.info(memory_usage='deep') to see memory usage including objects.

这篇关于如何释放 Pandas 数据帧使用的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆