重新编制索引时 pandas 的内存使用情况 [英] pandas memory usage when reindexing

查看:78
本文介绍了重新编制索引时 pandas 的内存使用情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道为什么pandas在重新编制系列索引时会占用大量内存.

I wonder why pandas has a large memory usage when reindexing a Series.

我创建了一个简单的数据集:

I create a simple dataset:

a = pd.Series(np.arange(5e7, dtype=np.double))

根据我的Ubuntu上的top,整个会话约为820MB.

According to top on my Ubuntu, the whole session is about 820MB.

现在,如果我对此进行切片以提取前100个元素:

Now if I slice this to extract the first 100 elements:

a_sliced = a[:100]

这不会增加内存消耗.

相反,如果我在相同范围内重新索引a:

Instead if I reindex a on the same range:

a_reindexed = a.reindex(np.arange(100))

我的内存消耗约为1.8GB.还尝试使用gc.collect进行清理,但未成功.

I get a memory consumption of about 1.8GB. Tried also to cleanup with gc.collect without success.

我想知道这是否是预期的,是否有一种变通方法来重新索引大型数据集而又没有显着的内存开销.

I would like to know if this is expected and if there is a workaround to reindex large datasets without significant memory overhead.

我正在使用github中pandas的最新快照.

I am using a very recent snapshot of pandas from github.

推荐答案

索引使用哈希表将标签映射到位置.您可以通过Series.index._engine.mapping进行检查.必要时将创建此映射.如果索引为is_monotonic,则可以使用asof():

Index uses a Hashtable to map labels to locations. You can check this by Series.index._engine.mapping. This mapping is created when necessary. If the index is_monotonic, you can use asof():

import numpy as np
import pandas as pd
idx =["%07d" % x for x in range(int(2e6))]
a = pd.Series(np.arange(2e6, dtype=np.double), index=idx)
new_index = ["0000003", "0000020", "000002a"]

print a.index._engine.mapping # None
print a.reindex(new_index)
print a.index._engine.mapping # <pandas.hashtable.PyObjectHashTable object at ...>

a = pd.Series(np.arange(2e6, dtype=np.double), index=idx)
print a.asof(new_index)
print a.index._engine.mapping # None

如果要对不存在的标签进行更多控制,可以使用searchsorted()并自己执行逻辑:

If you want more control about not exist labels, you can use searchsorted() and do the logic yourself:

>>> a.index[a.index.searchsorted(new_index)] 
Index([u'0000003', u'0000020', u'0000030'], dtype=object)

这篇关于重新编制索引时 pandas 的内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆