pandas SparseSeries可以将值存储在float16 dtype中吗? [英] Can pandas SparseSeries store values in the float16 dtype?

查看:93
本文介绍了 pandas SparseSeries可以将值存储在float16 dtype中吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我之所以要在稀疏的pandas容器中使用较小的数据类型,是为了减少内存使用量.当处理最初使用bool(例如来自to_dummies的)或较小数字dtypes(例如int8)的数据时,这些数据都将在稀疏容器中转换为float64.

The reason why I want to use a smaller data type in the sparse pandas containers is to reduce memory usage. This is relevant when working with data that originally uses bool (e.g. from to_dummies) or small numeric dtypes (e.g. int8), which are all converted to float64 in sparse containers.

提供的示例使用适度的20k x 145数据帧.实际上,我正在按1e6 x 5e3的顺序处理数据帧.

The provided example uses a modest 20k x 145 dataframe. In practice I'm working with dataframes in the order of 1e6 x 5e3.

In []: bool_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19849 entries, 0 to 19848
Columns: 145 entries, topic.party_nl.p.pvda to topic.sub_cat_Reizen
dtypes: bool(145)
memory usage: 2.7 MB

In []: bool_df.memory_usage(index=False).sum()
Out[]: 2878105

In []: bool_df.values.itemsize
Out[]: 1

此数据帧的稀疏版本需要较少的内存,但在给定原始dtype的情况下,仍比所需的内存大得多.

A sparse version of this dataframe needs less memory, but is still much larger than needed, given the original dtype.

In []: sparse_df = bool_df.to_sparse(fill_value=False)

In []: sparse_df.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
RangeIndex: 19849 entries, 0 to 19848
Columns: 145 entries, topic.party_nl.p.pvda to topic.sub_cat_Reizen
dtypes: float64(145)
memory usage: 1.1 MB

In []: sparse_df.memory_usage(index=False).sum()
Out[]: 1143456

In []: sparse_df.values.itemsize
Out[]: 8

即使此数据相当稀疏,从bool到float64的dtype转换也会导致非填充值占用8倍的空间.

Even though this data is fairly sparse, the dtype conversion from bool to float64 causes non-fill values to take up 8x more space.

In []: sparse_df.memory_usage(index=False).describe()
Out[]:
count      145.000000
mean      7885.903448
std      17343.762402
min          8.000000
25%        640.000000
50%       1888.000000
75%       4440.000000
max      84688.000000

鉴于数据的稀疏性,人们希望更大程度地减少内存大小:

Given the sparsity of the data, one would hope for a more drastic reduction in memory size:

In []: sparse_df.density
Out[]: 0.04966184346992205

基础存储的内存占用量

SparseDataFrame的列是SparseSeries,它们将SparseArray用作基础numpy.ndarray存储的包装.稀疏数据帧使用的字节数也可以直接从这些ndarrays中计算出来:

Memory footprint of underlying storage

The columns of SparseDataFrame are SparseSeries, which use SparseArray as a wrapper for the underlying numpy.ndarray storage. The number of bytes that are used by the sparse dataframe can (also) be computed directly from these ndarrays:

In []: col64_nbytes = [
.....:     sparse_df[col].values.sp_values.nbytes
.....:     for col in sparse_df
.....: ]

In []: sum(col64_nbytes)
Out[]: 1143456

可以将ndarrays转换为使用较小的浮点数,这使人们可以计算使用例如float16s.如人们所料,这将使数据帧缩小4倍.

The ndarrays can be converted to use smaller floats, which allows one to calculate how much memory the dataframe would need when using e.g. float16s. This would result in a 4x smaller dataframe, as one might expect.

In []: col16_nbytes = [
.....:     sparse_df[col].values.sp_values.astype('float16').nbytes
.....:     for col in sparse_df
.....: ]

In []: sum(col16_nbytes)
Out[]: 285864

通过使用更合适的dtype,可以将内存使用率减少到密集版本的10%,而float64稀疏数据帧减少到40%.对于我的数据,这可能需要20 GB和5 GB的可用内存.

By using the more appropriate dtype, the memory usage can be reduced to 10% of the dense version, whereas the float64 sparse dataframe reduces to 40%. For my data, this could make the difference between needing 20 GB and 5 GB of available memory.

In []: sum(col64_nbytes) / bool_df.memory_usage(index=False).sum()
Out[]: 0.3972947477593764

In []: sum(col16_nbytes) / bool_df.memory_usage(index=False).sum()
Out[]: 0.0993236869398441

问题

不幸的是,稀疏容器的dtype转换尚未在熊猫中实现:

Issue

Unfortunately, dtype conversion of sparse containers has not been implemented in pandas:

In []: sparse_df.astype('float16')
---------------------------------------------------
[...]/pandas/sparse/frame.py in astype(self, dtype)
    245
    246     def astype(self, dtype):
--> 247         raise NotImplementedError
    248
    249     def copy(self, deep=True):

NotImplementedError:

如何将SparseDataFrame中的SparseSeries转换为使用numpy.float16数据类型,或者转换为每个项目使用少于64个字节的另一种dtype,而不是默认的numpy.float64?

How can the SparseSeries in a SparseDataFrame be converted to use the numpy.float16 data type, or another dtype that uses fewer than 64 bytes per item, instead of the default numpy.float64?

推荐答案

SparseArray构造函数可用于转换其基础ndarray的dtype.要转换数据帧中的所有稀疏序列,可以迭代df的序列,转换其数组,然后将序列替换为转换后的版本.

The SparseArray constructor can be used to convert its underlying ndarray's dtype. To convert all sparse series in a dataframe, one can iterate over the df's series, convert their arrays, and replace the series with converted versions.

import pandas as pd
import numpy as np

def convert_sparse_series_dtype(sparse_series, dtype):
    dtype = np.dtype(dtype)
    if 'float' not in str(dtype):
        raise TypeError('Sparse containers only support float dtypes')

    sparse_array = sparse_series.values
    converted_sp_array = pd.SparseArray(sparse_array, dtype=dtype)

    converted_sp_series = pd.SparseSeries(converted_sp_array)
    return converted_sp_series


def convert_sparse_columns_dtype(sparse_dataframe, dtype):
    for col_name in sparse_dataframe:
        if isinstance(sparse_dataframe[col_name], pd.SparseSeries):
            sparse_dataframe.loc[:, col_name] = convert_sparse_series_dtype(
                 sparse_dataframe[col_name], dtype
            )

这达到了减少稀疏数据帧的内存占用的既定目的:

This achieves the stated purpose of reducing the sparse dataframe's memory footprint:

In []: sparse_df.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
RangeIndex: 19849 entries, 0 to 19848
Columns: 145 entries, topic.party_nl.p.pvda to topic.sub_cat_Reizen
dtypes: float64(145)
memory usage: 1.1 MB

In []: convert_sparse_columns_dtype(sparse_df, 'float16')

In []: sparse_df.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
RangeIndex: 19849 entries, 0 to 19848
Columns: 145 entries, topic.party_nl.p.pvda to topic.sub_cat_Reizen
dtypes: float16(145)
memory usage: 279.2 KB

In []: bool_df.equals(sparse_df.to_dense().astype('bool'))
Out[]: True

但是,这是一个比较糟糕的解决方案,因为转换后的数据帧在与其他数据帧交互时会表现出不可预测的行为.例如,当将转换后的稀疏数据帧与其他数据帧连接在一起时,所有包含的序列都变为密集序列.对于未转换的稀疏数据帧,情况并非如此.它们在结果数据框中仍然是稀疏序列.

It is, however, a somewhat lousy solution, because the converted dataframe behaves unpredictibly when it interacts with other dataframes. For instance, when converted sparse dataframes are concatenated with other dataframes, all contained series become dense series. This is not the case for unconverted sparse dataframes. They remain sparse series in the resulting dataframe.

这篇关于 pandas SparseSeries可以将值存储在float16 dtype中吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆