Pandas DataFrame-汇总谁的dtype =='category'列会导致性能降低 [英] Pandas DataFrame - Aggregate on column whos dtype=='category' leads to slow performance

查看:322
本文介绍了Pandas DataFrame-汇总谁的dtype =='category'列会导致性能降低的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用内存使用率较高的大数据帧,并且读到如果更改重复值列上的dtype可以节省大量内存.

I work with big dataframes with high memory usage and I read that if I change the dtype on repeated values columns I can save big amount of memory.

我尝试了一下,实际上确实使内存使用率降低了25%,但是后来遇到了我无法理解的性能下降.

I tried it and indeed it dropped the memory usage by 25% but then I bumped into a performance slowness which I could not understand.

我在dtype的类别"列上进行分组汇总,在更改dtype之前,它花费了大约1秒钟,而在更改之后,它花费了大约1分钟.

I do group-by aggregation on the dtype 'category' columns and before I changed the dtype it took about 1 second and after the change it took about 1 minute.

此代码演示性能下降了2倍.

This code demonstrates the performance degradation by factor of 2:

import pandas as pd
import random

animals = ['Dog', 'Cat']
days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday']

columns_dict = {'animals': [],
                'days': []}

for i in range(1000000):
    columns_dict['animals'].append(animals[random.randint(0, len(animals)-1)])
    columns_dict['days'].append(days[random.randint(0, len(days)-1)])

# df without 'category' dtype
df = pd.DataFrame(columns_dict)

df.info(memory_usage='deep') # will result in memory usage of 95.5 MB

%timeit -n100 df.groupby('days').agg({'animals': 'first'})
# will result in: 100 loops, best of 3: 54.2 ms per loop

# df with 'category' dtype
df2 = df.copy()
df2['animals'] = df2['animals'].astype('category')

df2.info(memory_usage='deep') # will result in memory usage of 50.7 MB

%timeit -n100 df2.groupby('days').agg({'animals': 'first'})
# will result in: 100 loops, best of 3: 111 ms per loop

我试图理解的是这种缓慢的原因是什么,以及是否有办法克服它.

What I try to understand is what is the cause of this slowness and if there is a way to overcome it.

谢谢!

推荐答案

我不确定这种放缓来自何处,但是一种解决方法是直接存储类别代码:

I'm not certain where this slowdown is coming from, but one workaround is to store the category codes directly:

df3 = df.copy()
animals = pd.Categorical(df['animals'])
df3['animals'] = animals.codes
df3.groupby('days').agg({'animals': 'first'}).apply(lambda code: animals.categories[code])

这不是最干净的解决方案,因为它需要外部元数据,但它可以同时实现所需的内存效率和计算速度.深入研究熊猫内部正在做的事情会导致分类速度变慢.

It's not the cleanest solution, because it requires external metadata, but it achieves both the memory efficiency and the computational speed you're looking for. It would be interesting to dig into what Pandas is doing internally that causes this slowdown for categoricals.

我追踪了为什么发生这种情况...作为first()聚合的一部分,熊猫

I tracked down why this happens... as part of the first() aggregation, pandas calls np.asarray() on the column. In the case of a categorical column, this ends up converting the column back to non-categoricals, leading to unnecessary overhead. Fixing this would be a useful contribution to the pandas package!

这篇关于Pandas DataFrame-汇总谁的dtype =='category'列会导致性能降低的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆