减少大型分类变量的级数 [英] Reduce number of levels for large categorical variables

查看：187 发布时间：2017/8/16 20:42:14 python r encoding categorical-data binning

本文介绍了减少大型分类变量的级数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有没有可以使用python或r的库或包来减少大的分类因素的数量？

我想要实现类似于<一个href =https://stackoverflow.com/questions/28968983/r-binning-categorical-variables> R：Binning分类变量，但编码为最常见的最高k因子和其他因子。

解决方案

这是一个例子在 R 中使用 data.table 有点，但如果没有 data.table，应该很容易 $。

 ＃加载data.table 
 require（data.table）
 
＃一些数据
 set.seed（1）
 dt<  -  data.table（type = factor（sample（c（A，B，C ），10e3，replace = T）），
 weight = rnorm（n = 10e3，mean = 70，sd = 20））
 
＃决定一个级别需要的最低频率... 
 min.freq<  -  3350 
 
＃不符合最小值的级别（使用data.table）
 fail.min.f < -  dt [，。 N，类型] [N < min.freq，type] 
 
＃调用所有这些级别其他
级别（dt $ type）[fail.min.f] < - 其他

Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors?

I want to achieve something similar to R: "Binning" categorical variables but encode into the most frequently top-k factors and "other".

解决方案

Here is an example in R using data.table a bit, but it should be easy without data.table also.

# Load data.table
require(data.table)

# Some data
set.seed(1)
dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)),
                 weight = rnorm(n = 10e3, mean = 70, sd = 20))

# Decide the minimum frequency a level needs...
min.freq <- 3350

# Levels that don't meet minumum frequency (using data.table)
fail.min.f <- dt[, .N, type][N < min.freq, type]

# Call all these level "Other"
levels(dt$type)[fail.min.f] <- "Other"

这篇关于减少大型分类变量的级数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

减少大型分类变量的级数 [英] Reduce number of levels for large categorical variables

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

减少大型分类变量的级数 [英] Reduce number of levels for large categorical variables

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭