内存有效的方式来在 pandas 中存储bool和NaN值 [英] Memory efficient way to store bool and NaN values in pandas
问题描述
我正在处理相当大的数据集(超过4 GB),该数据集是在pandas
中导入的.此数据集中的某些列是简单的True/False指示符,自然地,存储这些列的最节省内存的方法是为此列使用bool
dtype.但是,该列还包含一些我要保留的NaN值.现在,这导致具有dtype float(具有值1.0
,0.0
和np.nan
)或对象的列,但是它们都占用了太多的内存.
I am working with quite a large dataset (over 4 GB), which I imported in pandas
. Quite some columns in this dataset are simple True/False indicators, and naturally the most memory-efficient way to store these would be using a bool
dtype for this column. However, the column also contains some NaN values I want to preserve. Right now, this leads to the column having dtype float (with values 1.0
, 0.0
and np.nan
) or object, but they both use way too much memory.
例如:
df = pd.DataFrame([[True,True,True],[False,False,False],
[np.nan,np.nan,np.nan]])
df[1] = df[1].astype(bool)
df[2] = df[2].astype(float)
print(df)
print(df.memory_usage(index=False, deep=True))
print(df.memory_usage(index=False, deep=False))
产生
0 1 2
0 True True 1.0
1 False False 0.0
2 NaN True NaN
0 100
1 3
2 24
dtype: int64
0 24
1 3
2 24
dtype: int64
知道这些值只能采用3种不同类型的值:True
,False
和<undefined>
What would be the most efficient way to store these kinds of values, knowing they can only take on 3 different kinds of values: True
, False
and <undefined>
推荐答案
Use dtype: int8
1 = True
0 = False
-1 = NaN
这是float32
的4倍,是float64
这篇关于内存有效的方式来在 pandas 中存储bool和NaN值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!