删除 pandas 中未使用类别的更快方法? [英] A Faster Way of Removing Unused Categories in Pandas?

查看：69 发布时间：2020/5/24 1:08:42 python pandas categorical-data

本文介绍了删除 pandas 中未使用类别的更快方法?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在Python中运行某些模型，并在类别上添加了数据子集.

I'm running some models in Python, with data subset on categories.

对于内存使用和预处理，所有类别变量都存储为类别数据类型.

For memory usage, and preprocessing, all the categorical variables are stored as category data type.

对于分组依据"列中分类变量的每个级别，我正在运行回归，在该回归中，我需要将所有分类变量重置为该子集中的分类变量.

For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset.

我目前正在使用.cat.remove_unused_categories()进行此操作，这占用了我总运行时间的近50％.目前，最严重的违规者是我的分组专栏，其他人花费的时间不多(因为我想下降的水平不多).

I am currently doing this using .cat.remove_unused_categories(), which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels to drop).

这是一个简化的示例:

import itertools
import pandas as pd
#generate some fake data
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 2)]
z = pd.DataFrame({'x':keywords})

#convert to category datatype
z.x = z.x.astype('category')

#groupby
z = z.groupby('x')

#loop over groups
for i in z.groups:
    x = z.get_group(i)
    x.x = x.x.cat.remove_unused_categories()
    #run my fancy model here

在我的笔记本电脑上，这大约需要20秒.对于这个小例子，我们可以转换为str，然后返回类别以加快速度，但是我的真实数据每组至少有300行.

On my laptop, this takes about 20 seconds. for this small example, we could convert to str, then back to category for a speed up, but my real data has at least 300 lines per group.

是否可以加快此循环?我已经尝试过使用x.x = x.x.cat.set_categories(i)花费相同的时间，而尝试使用x.x.cat.categories = i来请求与开始时相同的类别数.

Is it possible to speed up this loop? I have tried using x.x = x.x.cat.set_categories(i) which takes a similar time, and x.x.cat.categories = i, which asks for the same number of categories as I started with.

删除 pandas 中未使用类别的更快方法? [英] A Faster Way of Removing Unused Categories in Pandas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

删除 pandas 中未使用类别的更快方法? [英] A Faster Way of Removing Unused Categories in Pandas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭