删除 pandas 中未使用类别的更快方法? [英] A Faster Way of Removing Unused Categories in Pandas?
问题描述
我正在Python中运行某些模型,并在类别上添加了数据子集.
I'm running some models in Python, with data subset on categories.
对于内存使用和预处理,所有类别变量都存储为类别数据类型.
For memory usage, and preprocessing, all the categorical variables are stored as category data type.
对于分组依据"列中分类变量的每个级别,我正在运行回归,在该回归中,我需要将所有分类变量重置为该子集中的分类变量.
For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset.
我目前正在使用.cat.remove_unused_categories()
进行此操作,这占用了我总运行时间的近50%.目前,最严重的违规者是我的分组专栏,其他人花费的时间不多(因为我想下降的水平不多).
I am currently doing this using .cat.remove_unused_categories()
, which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels to drop).
这是一个简化的示例:
import itertools
import pandas as pd
#generate some fake data
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 2)]
z = pd.DataFrame({'x':keywords})
#convert to category datatype
z.x = z.x.astype('category')
#groupby
z = z.groupby('x')
#loop over groups
for i in z.groups:
x = z.get_group(i)
x.x = x.x.cat.remove_unused_categories()
#run my fancy model here
在我的笔记本电脑上,这大约需要20秒.对于这个小例子,我们可以转换为str,然后返回类别以加快速度,但是我的真实数据每组至少有300行.
On my laptop, this takes about 20 seconds. for this small example, we could convert to str, then back to category for a speed up, but my real data has at least 300 lines per group.
是否可以加快此循环?我已经尝试过使用x.x = x.x.cat.set_categories(i)
花费相同的时间,而尝试使用x.x.cat.categories = i
来请求与开始时相同的类别数.
Is it possible to speed up this loop? I have tried using x.x = x.x.cat.set_categories(i)
which takes a similar time, and x.x.cat.categories = i
, which asks for the same number of categories as I started with.
推荐答案
您的问题是,您正在将z.get_group(i)
分配给x
. x
现在是z
一部分的副本.您的代码可以通过此更改正常运行
Your problem is in that you are assigning z.get_group(i)
to x
. x
is now a copy of a portion of z
. Your code will work fine with this change
for i in z.groups:
x = z.get_group(i).copy() # will no longer be tied to z
x.x = x.x.cat.remove_unused_categories()
这篇关于删除 pandas 中未使用类别的更快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!