处理 pandas 中的稀疏类别-将所有不在顶级类别中的内容替换为“其他"类别. [英] Dealing with sparse categories in Pandas - replace everything not in top categories with "Other"

查看:36
本文介绍了处理 pandas 中的稀疏类别-将所有不在顶级类别中的内容替换为“其他"类别.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

清理数据时,我经常遇到以下常见问题还有一些更常见的类别(比如说十大电影类型),还有很多稀疏的类别.例如,这里通常的做法是将稀疏类型结合到其他"中.

I often come across the following common problem when cleaning the data there are some more common categories (let's say top 10 movie genres) and lots and lots of others which are sparse. Usual practice here would be to combine sparse genres into "Other" for example.

在稀疏类别不多的情况下很容易做到:

Easily done when there are not many sparse categories:

# Join bungalows as they are sparse classes into 1
df.property_type.replace(['Terraced bungalow','Detached bungalow', 'Semi-detached bungalow'], 'Bungalow', inplace=True)

但是,例如,如果我有一个电影数据集,其中有8个大型电影制片厂制作的大多数电影,而我想将"other"制片厂下的所有其他电影结合起来,那么选出前8个制片厂是很有意义的:

but if for example I have a movies dataset with majority of the movies produced by let's say 8 big studios and I would like to combine everything else under "other" studio, it makes sense to get top 8 studios:

top_8_list = []
top_8 = df.studio.value_counts().head(8)
for key, value in top_8.iteritems():
    top_8_list.append(key)

top_8_list
top_8_list
['Universal Pictures',
 'Warner Bros.',
 'Paramount Pictures',
 'Twentieth Century Fox Film Corporation',
 'New Line Cinema',
 'Columbia Pictures Corporation',
 'Touchstone Pictures',
 'Columbia Pictures']

然后做类似的事情

将工作室不在前8名"列表中的工作室替换为其他"

replace studio where studio is not in the top 8 list with "other"

那么问题是,如果有人知道在熊猫中有什么优雅的解决方案吗?这是非常常见的数据清理任务

so the question, if anyone knows any elegant solution in pandas for this? This is very common data cleaning task

推荐答案

您可以将列转换为

You could convert the column to type Categorical which has added memory benefits:

top_cats = df.studio.value_counts().head(8).index.tolist() + ['other']
df['studio'] = pd.Categorical(df['studio'], categories=top_cats).fillna('other')

这篇关于处理 pandas 中的稀疏类别-将所有不在顶级类别中的内容替换为“其他"类别.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆