Python:减少缺少值的小整数的内存使用量 [英] Python : reducing memory usage of small integers with missing values

查看:42
本文介绍了Python:减少缺少值的小整数的内存使用量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在减少代码的内存使用量.这段代码的目标是处理一些大数据集.如果相关的话,这些存储在Pandas数据框中.

I am in the process of reducing the memory usage of my code. The goal of this code is handling some big dataset. Those are stored in Pandas dataframe if that is relevant.

在许多其他数据中,有一些小整数.由于它们包含一些缺失值(NA),Python默认将它们设置为float64类型.我试图将它们转换为较小的int格式(例如int8或int16),但由于不适用,因此出现了错误.

Among many other data there are some small integers. As they contain some missing values (NA) Python has them set to the float64 type by default. I was trying to downcast them to some smaller int format (int8 or int16 for exemple), but I got an error because of the NA.

似乎有些新的整数类型(Int64)可以处理缺失的值,但对内存使用无济于事.我对使用类别提出了一些建议,但是我不确定这不会在开发过程中造成瓶颈.将float64向下转换为float32似乎是减少内存使用的主要选择(舍入错误对我的使用并不重要).

It seems that there are some new integer type (Int64) that can handle missing values but wouldn't help for the memory usage. I gave some tought about using a category, but I am not sure this will not create a bottleneck further down the pipeline. Downcasting float64 to float32 seems to be my main option for reducing memory usage (rounding error do not really matter for my usage).

我是否有更好的选择来减少处理带有缺失值的小整数的内存消耗?

Do I have a better option to reduce memory consumption of handling small integers with missing values ?

推荐答案

新的(Pandas v1.0 +)整数数组"(Integer Array)数据类型确实可以节省大量内存.缺少的值可通过Pandas .isnull()识别,并且与Pyarrow Feather格式兼容,该格式对于磁盘而言可高效写入数据.Feather需要按列提供一致的数据类型.请参阅熊猫文档在这里.这是一个例子.请注意,Pandas特定的Int16数据类型中的大写字母"I".

The new (Pandas v1.0+) "Integer Array" data types do allow significant memory savings. Missing values are recognized by Pandas .isnull() and also are compatible with Pyarrow feather format that is disk-efficient for writing data. Feather requires consistent data type by column. See Pandas documentation here. Here is an example. Note the capital 'I' in the Pandas-specific Int16 data type.

import pandas as pd
import numpy as np

dftemp = pd.DataFrame({'dt_col': ['1/1/2020',np.nan,'1/3/2020','1/4/2020'], 'int_col':[4,np.nan,3,1],
                      'float_col':[0.0,1.0,np.nan,4.5],'bool_col':[True, False, False, True],'text_col':['a','b',None,'d']})

#Write to CSV (to be read back in to fully simulate CSV behavior with missing values etc.)
dftemp.to_csv('MixedTypes.csv', index=False)

lst_cols = ['int_col','float_col','bool_col','text_col']
lst_dtypes = ['Int16','float','bool','object']
dict_types = dict(zip(lst_cols,lst_dtypes))

#Unoptimized DataFrame    
df = pd.read_csv('MixedTypes.csv')
df

结果:

     dt_col  int_col  float_col  bool_col text_col
0  1/1/2020      4.0        0.0      True        a
1       NaN      NaN        1.0     False        b
2  1/3/2020      3.0        NaN     False      NaN
3  1/4/2020      1.0        4.5      True        d

检查内存使用情况(特别关注int_col):

Check memory usage (with special focus on int_col):

df.memory_usage()

结果:

Index        128
dt_col        32
int_col       32
float_col     32
bool_col       4
text_col      32
dtype: int64

通过显式分配变量类型进行重复-包括int_col的Int16

Repeat with explicit assignment of variable types --including Int16 for int_col

df2 = pd.read_csv('MixedTypes.csv', dtype=dict_types, parse_dates=['dt_col'])
print(df2)

      dt_col  int_col  float_col  bool_col text_col
0 2020-01-01        4        0.0      True        a
1        NaT     <NA>        1.0     False        b
2 2020-01-03        3        NaN     False      NaN
3 2020-01-04        1        4.5      True        d

df2.memory_usage()

根据我的经验,在较大规模的数据中,这将显着提高内存和磁盘空间的效率:

In larger scale data, this results in significant memory and disk space efficiency from my experience:

Index        128
dt_col        32
int_col       12
float_col     32
bool_col       4
text_col      32
dtype: int64

这篇关于Python:减少缺少值的小整数的内存使用量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆