如果值计数低于阈值,则将列值映射为“其他"-分类列-Pandas Dataframe [英] map column values to 'miscellaneous' if value counts is below a threshold - Categorical Column - Pandas Dataframe
本文介绍了如果值计数低于阈值,则将列值映射为“其他"-分类列-Pandas Dataframe的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个形状为[200K,40]的熊猫数据框.数据框具有一个分类列(多个列中的一个),具有超过1000个唯一值.我可以使用以下方法可视化每个此类唯一列的值计数:
I have a pandas dataframe of shape ~ [200K, 40]. The dataframe has a categorical column (one of many) with over 1000 unique values. I can visualizee the value counts of each such unique column by using:
df['column_name'].value_counts()
我现在如何通过以下方式来体现价值观:
How do i now club values with:
- value_count是否小于阈值(例如100),并将其映射为其他"?
- 根据累积行数%进行或"运算吗?
推荐答案
您可以从value_counts
的索引中提取要屏蔽的值,然后使用替换:
You can extract the values you want to mask from the index of value_counts
and them map them to "miscellaneous" using replace:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, (2000, 2)), columns=['A', 'B'])
frequencies = df['A'].value_counts()
condition = frequencies<200 # you can define it however you want
mask_obs = frequencies[condition].index
mask_dict = dict.fromkeys(mask_obs, 'miscellaneous')
df['A'] = df['A'].replace(mask_dict) # or you could make a copy not to modify original data
现在,使用value_counts会将所有低于阈值的值归类为杂项:
Now, using value_counts will group all the values below your threshold as miscellaneous:
df['A'].value_counts()
df['A'].value_counts()
Out[18]:
miscellaneous 947
3 226
1 221
0 204
7 201
2 201
这篇关于如果值计数低于阈值,则将列值映射为“其他"-分类列-Pandas Dataframe的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文