如果dtype为类别(MemoryError)，则ivot_table需要更多内存 [英] pivot_table requires more memory if dtype is category (MemoryError)

查看：85 发布时间：2020/5/24 3:34:47 python python-3.x pandas dataframe

本文介绍了如果dtype为类别(MemoryError)，则ivot_table需要更多内存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在使用pandas(pandas == 0.23.1)时遇到以下奇怪的错误:

I have the following strange error with pandas(pandas==0.23.1) :

import pandas as pd
df = pd.DataFrame({'t1': ["a","b","c"]*10000, 't2': ["x","y","z"]*10000, 'i1': list(range(5000))*6, 'i2': list(range(5000))*6, 'dummy':0})
# works fast with less memory
piv = df.pivot_table(values='dummy', index=['i1','i2'], columns=['t1','t2'])

d2 = df.copy()
d2.t1 = d2.t1.astype('category')
d2.t2 = d2.t2.astype('category')

# needs > 20GB of memory and takes for ever
piv2 = d2.pivot_table(values='dummy', index=['i1','i2'], columns=['t1','t2'])

我想知道这是否是预期的，并且我做错了什么，或者这是否是熊猫中的错误. str的dtype category应该不是非常透明吗?

I am wondering if this is expected and I am doing something wrong, or if this is a bug in pandas. Should dtype category for str not be very transparent (for this use case)?

推荐答案

这不是错误.发生了什么事 pandas.pivot_table 正在计算笛卡尔石斑鱼类别的产品.

This is not a bug. What's happening is pandas.pivot_table is calculating the Cartesian product of grouper categories.

这是已知的预期行为.在Pandas v0.23.0中，我们看到了 pandas.groupby .设置observed=True只包括观察到的组合；默认情况下为False.此参数尚未推广到相关方法，例如pandas.pivot_table.我认为应该如此.

This is a known intended behaviour. In Pandas v0.23.0, we saw the introduction of the observed argument for pandas.groupby. Setting observed=True only includes observed combinations; it is False by default. This argument has not yet been rolled out to related methods such as pandas.pivot_table. In my opinion, it should be.

但是现在让我们看看是什么意思.我们可以使用一个示例数据框，看看当我们print结果时会发生什么.

But now let's see what this means. We can use an example dataframe and see what happens when we print the result.

我们使数据框大大缩小:

We make the dataframe substantially smaller:

import pandas as pd

n = 10

df = pd.DataFrame({'t1': ["a","b","c"]*n, 't2': ["x","y","z"]*n,
                   'i1': list(range(int(n/2)))*6, 'i2': list(range(int(n/2)))*6,
                   'dummy':0})

没有类别

这可能是您要寻找的.数据透视表中未显示类别的不可观察组合.

Without categories

This is likely what you are looking for. Unobserved combinations of categories are not represented in your pivot table.

piv = df.pivot_table(values='dummy', index=['i1','i2'], columns=['t1','t2'])
print(piv)

t1     a  b  c
t2     x  y  z
i1 i2         
0  0   0  0  0
1  1   0  0  0
2  2   0  0  0
3  3   0  0  0
4  4   0  0  0

带有类别

对于类别，结果中将包括类别的所有组合，甚至是未观察到的组合.这在计算上和内存上都是昂贵的.此外，数据帧由未观察到的组合中的NaN所控制. 不是您想要的.

With categories

With categories, all combinations of categories, even unobserved combinations, are accounted for in the result. This is expensive computationally and memory-hungry. Moreover, the dataframe is dominated by NaN from unobserved combinations. It's probably not what you want.

d2 = df.copy()
d2.t1 = d2.t1.astype('category')
d2.t2 = d2.t2.astype('category')

piv2 = d2.pivot_table(values='dummy', index=['i1','i2'], columns=['t1','t2'])
print(piv2)

t1       a           b            c         
t2       x   y   z   x    y   z   x   y    z
i1 i2                                       
0  0   0.0 NaN NaN NaN  0.0 NaN NaN NaN  0.0
   1   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   2   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   3   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   4   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
1  0   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   1   0.0 NaN NaN NaN  0.0 NaN NaN NaN  0.0
   2   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   3   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   4   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
2  0   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   1   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   2   0.0 NaN NaN NaN  0.0 NaN NaN NaN  0.0
   3   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   4   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
3  0   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   1   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   2   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   3   0.0 NaN NaN NaN  0.0 NaN NaN NaN  0.0
   4   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
4  0   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   1   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   2   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   3   NaN NaN NaN NaN  NaN NaN NaN NaN  NaN
   4   0.0 NaN NaN NaN  0.0 NaN NaN NaN  0.0

这篇关于如果dtype为类别(MemoryError)，则ivot_table需要更多内存的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如果dtype为类别(MemoryError)，则ivot_table需要更多内存 [英] pivot_table requires more memory if dtype is category (MemoryError)

问题描述

推荐答案

没有类别

Without categories

带有类别

With categories

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如果dtype为类别(MemoryError)，则ivot_table需要更多内存 [英] pivot_table requires more memory if dtype is category (MemoryError)

问题描述

推荐答案

没有类别

Without categories

带有类别

With categories

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭