删除 pandas 中未使用类别的更快方法? [英] A Faster Way of Removing Unused Categories in Pandas?

查看:69
本文介绍了删除 pandas 中未使用类别的更快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Python中运行某些模型,并在类别上添加了数据子集.

I'm running some models in Python, with data subset on categories.

对于内存使用和预处理,所有类别变量都存储为类别数据类型.

For memory usage, and preprocessing, all the categorical variables are stored as category data type.

对于分组依据"列中分类变量的每个级别,我正在运行回归,在该回归中,我需要将所有分类变量重置为该子集中的分类变量.

For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset.

我目前正在使用.cat.remove_unused_categories()进行此操作,这占用了我总运行时间的近50%.目前,最严重的违规者是我的分组专栏,其他人花费的时间不多(因为我想下降的水平不多).

I am currently doing this using .cat.remove_unused_categories(), which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels to drop).

这是一个简化的示例:

import itertools
import pandas as pd
#generate some fake data
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 2)]
z = pd.DataFrame({'x':keywords})

#convert to category datatype
z.x = z.x.astype('category')

#groupby
z = z.groupby('x')

#loop over groups
for i in z.groups:
    x = z.get_group(i)
    x.x = x.x.cat.remove_unused_categories()
    #run my fancy model here

在我的笔记本电脑上,这大约需要20秒.对于这个小例子,我们可以转换为str,然后返回类别以加快速度,但是我的真实数据每组至少有300行.

On my laptop, this takes about 20 seconds. for this small example, we could convert to str, then back to category for a speed up, but my real data has at least 300 lines per group.

是否可以加快此循环?我已经尝试过使用x.x = x.x.cat.set_categories(i)花费相同的时间,而尝试使用x.x.cat.categories = i来请求与开始时相同的类别数.

Is it possible to speed up this loop? I have tried using x.x = x.x.cat.set_categories(i) which takes a similar time, and x.x.cat.categories = i, which asks for the same number of categories as I started with.

推荐答案

您的问题是,您正在将z.get_group(i)分配给x. x现在是z一部分的副本.您的代码可以通过此更改正常运行

Your problem is in that you are assigning z.get_group(i) to x. x is now a copy of a portion of z. Your code will work fine with this change

for i in z.groups:
    x = z.get_group(i).copy() # will no longer be tied to z
    x.x = x.x.cat.remove_unused_categories()

这篇关于删除 pandas 中未使用类别的更快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆