减少大型分类变量的级数 [英] Reduce number of levels for large categorical variables
本文介绍了减少大型分类变量的级数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想要实现类似于<一个href =https://stackoverflow.com/questions/28968983/r-binning-categorical-variables> R:Binning分类变量,但编码为最常见的最高k因子和其他因子。
解决方案
这是一个例子在 R
中使用 data.table
有点,但如果没有 data.table,应该很容易
$。
#加载data.table
require(data.table)
#一些数据
set.seed(1)
dt< - data.table(type = factor(sample(c(A,B,C ),10e3,replace = T)),
weight = rnorm(n = 10e3,mean = 70,sd = 20))
#决定一个级别需要的最低频率...
min.freq< - 3350
#不符合最小值的级别(使用data.table)
fail.min.f < - dt [,。 N,类型] [N < min.freq,type]
#调用所有这些级别其他
级别(dt $ type)[fail.min.f] < - 其他
Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors?
I want to achieve something similar to R: "Binning" categorical variables but encode into the most frequently top-k factors and "other".
解决方案
Here is an example in R
using data.table
a bit, but it should be easy without data.table
also.
# Load data.table
require(data.table)
# Some data
set.seed(1)
dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)),
weight = rnorm(n = 10e3, mean = 70, sd = 20))
# Decide the minimum frequency a level needs...
min.freq <- 3350
# Levels that don't meet minumum frequency (using data.table)
fail.min.f <- dt[, .N, type][N < min.freq, type]
# Call all these level "Other"
levels(dt$type)[fail.min.f] <- "Other"
这篇关于减少大型分类变量的级数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文