Python:结合低频因素/类别计数 [英] Python: Combining Low Frequency Factors/Category Counts
问题描述
R中有一个很棒的解决方案.
我的df.column
看起来像:
Windows
Windows
Mac
Mac
Mac
Linux
Windows
...
我想在此df.column
向量中将低频类别替换为其他".例如,我需要df.column
看起来像
I want to replace low frequency categories with 'Other' in this df.column
vector. For example, I need my df.column
to look like
Windows
Windows
Mac
Mac
Mac
Linux -> Other
Windows
...
我想重命名这些稀有类别,以减少回归中的因素数量.这就是为什么我需要原始向量.在python中,运行命令以获取频率表后,我得到:
I would like to rename these rare categories, to reduce the number of factors in my regression. This is why I need the original vector. In python, after running the command to get the frequency table I get:
pd.value_counts(df.column)
Windows 26083
iOS 19711
Android 13077
Macintosh 5799
Chrome OS 347
Linux 285
Windows Phone 167
(not set) 22
BlackBerry 11
我想知道是否有一种方法可以将"Chrome OS","Linux"(低频数据)重命名为另一个类别(例如,其他"类别),并以一种有效的方式进行重命名.
I wonder if there is a method to rename 'Chrome OS', 'Linux' (low frequency data) into another category (for example category 'Other', and do so in an efficient way.
推荐答案
通过查找占用百分比来屏蔽,即:
Mask by finding percentage of occupency i.e :
series = pd.value_counts(df.column)
mask = (series/series.sum() * 100).lt(1)
# To replace df['column'] use np.where I.e
df['column'] = np.where(df['column'].isin(series[mask].index),'Other',df['column'])
要使用sum更改索引:
To change the index with sum :
new = series[~mask]
new['Other'] = series[mask].sum()
Windows 26083
iOS 19711
Android 13077
Macintosh 5799
Other 832
Name: 1, dtype: int64
如果要替换索引,则:
series.index = np.where(series.index.isin(series[mask].index),'Other',series.index)
Windows 26083
iOS 19711
Android 13077
Macintosh 5799
Other 347
Other 285
Other 167
Other 22
Other 11
Name: 1, dtype: int64
说明
(series/series.sum() * 100) # This will give you the percentage i.e
Windows 39.820158
iOS 30.092211
Android 19.964276
Macintosh 8.853165
Chrome OS 0.529755
Linux 0.435101
Windows Phone 0.254954
(not set) 0.033587
BlackBerry 0.016793
Name: 1, dtype: float64
.lt(1)
等于小于1.这会根据该掩码索引为您提供一个布尔掩码并分配数据
.lt(1)
is equivalent to lesser than 1. That gives you a boolean mask, based on that mask index and assign the data
这篇关于Python:结合低频因素/类别计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!