重新编制索引时 pandas 的内存使用情况 [英] pandas memory usage when reindexing
问题描述
我想知道为什么pandas
在重新编制系列索引时会占用大量内存.
I wonder why pandas
has a large memory usage when reindexing a Series.
我创建了一个简单的数据集:
I create a simple dataset:
a = pd.Series(np.arange(5e7, dtype=np.double))
根据我的Ubuntu上的top
,整个会话约为820MB.
According to top
on my Ubuntu, the whole session is about 820MB.
现在,如果我对此进行切片以提取前100个元素:
Now if I slice this to extract the first 100 elements:
a_sliced = a[:100]
这不会增加内存消耗.
相反,如果我在相同范围内重新索引a
:
Instead if I reindex a
on the same range:
a_reindexed = a.reindex(np.arange(100))
我的内存消耗约为1.8GB.还尝试使用gc.collect
进行清理,但未成功.
I get a memory consumption of about 1.8GB. Tried also to cleanup with gc.collect
without success.
我想知道这是否是预期的,是否有一种变通方法来重新索引大型数据集而又没有显着的内存开销.
I would like to know if this is expected and if there is a workaround to reindex large datasets without significant memory overhead.
我正在使用github中pandas
的最新快照.
I am using a very recent snapshot of pandas
from github.
推荐答案
索引使用哈希表将标签映射到位置.您可以通过Series.index._engine.mapping
进行检查.必要时将创建此映射.如果索引为is_monotonic
,则可以使用asof()
:
Index uses a Hashtable to map labels to locations. You can check this by Series.index._engine.mapping
. This mapping is created when necessary. If the index is_monotonic
, you can use asof()
:
import numpy as np
import pandas as pd
idx =["%07d" % x for x in range(int(2e6))]
a = pd.Series(np.arange(2e6, dtype=np.double), index=idx)
new_index = ["0000003", "0000020", "000002a"]
print a.index._engine.mapping # None
print a.reindex(new_index)
print a.index._engine.mapping # <pandas.hashtable.PyObjectHashTable object at ...>
a = pd.Series(np.arange(2e6, dtype=np.double), index=idx)
print a.asof(new_index)
print a.index._engine.mapping # None
如果要对不存在的标签进行更多控制,可以使用searchsorted()
并自己执行逻辑:
If you want more control about not exist labels, you can use searchsorted()
and do the logic yourself:
>>> a.index[a.index.searchsorted(new_index)]
Index([u'0000003', u'0000020', u'0000030'], dtype=object)
这篇关于重新编制索引时 pandas 的内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!