如何释放 pandas 数据框使用的内存? [英] How do I release memory used by a pandas dataframe?

查看：299 发布时间：2020/5/8 18:56:41 python pandas memory

本文介绍了如何释放 pandas 数据框使用的内存?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个很大的csv文件，该文件按以下方式在熊猫中打开....

I have a really large csv file that I opened in pandas as follows....

import pandas
df = pandas.read_csv('large_txt_file.txt')

执行此操作后，内存使用量将增加2GB，这是可以预期的，因为此文件包含数百万行.当我需要释放此内存时，我的问题就来了.我跑了...

Once I do this my memory usage increases by 2GB, which is expected because this file contains millions of rows. My problem comes when I need to release this memory. I ran....

del df

但是，我的内存使用没有下降.这是释放熊猫数据帧使用的内存的错误方法吗?如果是，正确的方法是什么?

However, my memory usage did not drop. Is this the wrong approach to release memory used by a pandas data frame? If it is, what is the proper way?

减少数据帧的数量

Python将我们的内存保持在高水位线，但是我们可以减少创建的数据帧的总数.修改数据框时，最好使用inplace=True，这样就不会创建副本.

Reducing the Number of Dataframes

Python keep our memory at high watermark, but we can reduce the total number of dataframes we create. When modifying your dataframe, prefer inplace=True, so you don't create copies.

另一个常见的陷阱是在ipython中保留以前创建的数据帧的副本:

Another common gotcha is holding on to copies of previously created dataframes in ipython:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'foo': [1,2,3,4]})

In [3]: df + 1
Out[3]: 
   foo
0    2
1    3
2    4
3    5

In [4]: df + 2
Out[4]: 
   foo
0    3
1    4
2    5
3    6

In [5]: Out # Still has all our temporary DataFrame objects!
Out[5]: 
{3:    foo
 0    2
 1    3
 2    4
 3    5, 4:    foo
 0    3
 1    4
 2    5
 3    6}

您可以通过键入%reset Out清除历史记录来解决此问题.另外，您可以调整ipython在ipython --cache-size=5中保留的历史记录的数量(默认为1000).

You can fix this by typing %reset Out to clear your history. Alternatively, you can adjust how much history ipython keeps with ipython --cache-size=5 (default is 1000).

尽可能避免使用对象dtypes.

Wherever possible, avoid using object dtypes.

>>> df.dtypes
foo    float64 # 8 bytes per value
bar      int64 # 8 bytes per value
baz     object # at least 48 bytes per value, often more

带有dtype对象的值被装箱，这意味着numpy数组仅包含一个指针，并且堆中对于数据框中的每个值都有一个完整的Python对象.这包括字符串.

Values with an object dtype are boxed, which means the numpy array just contains a pointer and you have a full Python object on the heap for every value in your dataframe. This includes strings.

虽然numpy支持数组中固定大小的字符串，但pandas不支持(这是由于用户混乱).这可以带来很大的不同:

Whilst numpy supports fixed-size strings in arrays, pandas does not (it's caused user confusion). This can make a significant difference:

>>> import numpy as np
>>> arr = np.array(['foo', 'bar', 'baz'])
>>> arr.dtype
dtype('S3')
>>> arr.nbytes
9

>>> import sys; import pandas as pd
>>> s = pd.Series(['foo', 'bar', 'baz'])
dtype('O')
>>> sum(sys.getsizeof(x) for x in s)
120

您可能要避免使用字符串列，或者寻找一种将字符串数据表示为数字的方法.

You may want to avoid using string columns, or find a way of representing string data as numbers.

如果您的数据框包含许多重复值(NaN非常常见)，则可以使用稀疏数据结构以减少内存使用量:

If you have a dataframe that contains many repeated values (NaN is very common), then you can use a sparse data structure to reduce memory usage:

>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 605.5 MB

>>> df1.shape
(39681584, 1)

>>> df1.foo.isnull().sum() * 100. / len(df1)
20.628483479893344 # so 20% of values are NaN

>>> df1.to_sparse().info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 543.0 MB

查看内存使用情况

您可以查看内存使用情况(文档):

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 14 columns):
...
dtypes: datetime64[ns](1), float64(8), int64(1), object(4)
memory usage: 4.4+ GB

从熊猫0.17.1开始，您还可以执行df.info(memory_usage='deep')查看包括对象在内的内存使用情况.

As of pandas 0.17.1, you can also do df.info(memory_usage='deep') to see memory usage including objects.

这篇关于如何释放 pandas 数据框使用的内存?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何释放 pandas 数据框使用的内存? [英] How do I release memory used by a pandas dataframe?

问题描述

推荐答案

减少数据帧的数量

Reducing the Number of Dataframes

查看内存使用情况

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何释放 pandas 数据框使用的内存? [英] How do I release memory used by a pandas dataframe?

问题描述

推荐答案

减少数据帧的数量

Reducing the Number of Dataframes

查看内存使用情况

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭