pandas 数据透视表int32溢出 [英] Pandas pivot table int32 overflow

查看:1188
本文介绍了 pandas 数据透视表int32溢出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试使用Pandas的pandas.DataFrame.pivot方法创建数据透视表,并且遇到了一些问题.

我拥有的DataFrame的形状为(1300000, 6),因此生成的数据透视表可能会非常大.我得到的具体错误是:

ValueError: Unstacked DataFrame is too big, causing int32 overflow

我想到的一个(失败的)解决方案是将DataFrame拆分成较小的DataFrame,并用它们创建数据透视表,然后将这些数据透视表连接起来以创建我想要的原始大型数据透视表.

问题是每个切片的数据透视表的形状都不同,因为我正在调用df.pivot(index='col1', columns='col2')['col3'],并且每个切片的每一列的值都将不同.仅供参考,我用前100行制作的数据透视表的大小为(62, 63),而我用原始DataFrame的前半部分制作的数据透视表却给了我错误:

*** MemoryError: Unable to allocate array with shape (13218, 389275) and data type object

有人对我如何规避这个问题有任何想法吗?

万一有人想知道为什么我坚持要创建数据透视表,我目前正在与一个团队合作,他们的整个代码都是基于假设有一个数据透视表可以使用的.

解决方案

尝试分块读取数据源.

例如:read_csv具有一个属性:chunksize.

pandas文档中的详细信息: https://pandas.pydata.org/pandas- docs/stable/reference/api/pandas.read_csv.html

I'm currently trying to create a pivot table using Pandas' pandas.DataFrame.pivot method and am experiencing some problems.

The DataFrame that I have is of shape (1300000, 6) and so the resulting pivot table is probably going to be very large. The specific error that I get is:

ValueError: Unstacked DataFrame is too big, causing int32 overflow

A (failed) solution that I came up with was to split the DataFrame into smaller DataFrames, create pivot tables with those, and then concatenate those pivot tables to create the original large pivot table I wanted to.

The problem is that the shapes of the pivot tables are different for each slice, as I'm calling df.pivot(index='col1', columns='col2')['col3'] and the values of each column will be different from slice to slice. Just for reference, the pivot table that I made with the first 100 rows is of size (62, 63) whereas the pivot table I made with the first half of the original DataFrame gives me the error:

*** MemoryError: Unable to allocate array with shape (13218, 389275) and data type object

Does anybody have any idea on how I could circumvent this problem?

In case anyone's wondering on why I insist on creating pivot tables, I'm currently working with a team and their entire code is based on the assumption that there is a pivot table to work with.

解决方案

Try reading your datasource in chunks.

Ex: read_csv has an attribute: chunksize.

Details in pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

这篇关于 pandas 数据透视表int32溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆