R - 随机森林和超过 53 个类别 [英] R - Random Forest and more than 53 categories

查看:46
本文介绍了R - 随机森林和超过 53 个类别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道.RandomForest 无法处理超过 53 个类别.遗憾的是,我必须分析数据,一列有 165 个级别.因此,我想使用 RandomForest 进行分类.

I know. RandomForest is not able to handle more than 53 categories. Sadly I have to analyze data and one column has 165 levels. Therefor I want to use RandomForest for a classification.

我的问题是我无法删除此列,因为此预测器非常重要并且被称为有价值的预测器.

My problem is I cannot remove this columns since this predictor is really important and known as a valuable predictor.

这个预测变量有 165 个水平,是一个因子.

This predictor has 165 levels and is a factor.

有什么技巧可以解决这个问题吗?因为我们在谈论电影类型,所以我不知道.

Are there any tips how I can handle this? Since we are talking about film genre I have no idea.

是否有大数据的替代包?特殊的解决方法?这样的东西..

Are there alternative packages for big data? A special workaround? Something like this..

切换到 Python 是没有选择的.我们这里有太多的 R 脚本.

Switching to Python is no option. We have too many R scripts here.

非常感谢,祝一切顺利

str(data) 看起来像这样:

The str(data) looks like this:

'data.frame':   481696 obs. of  18 variables:
 $ SENDERNR          : int  432 1612 735 721 436 436 1321 721 721 434 ...
 $ SENDER            : Factor w/ 14 levels "ARD Das Erste",..: 6 3 4 9 12 12 10 9 9 7 ...
 $ GEPLANTE_SENDUNG_N: Factor w/ 12563 levels "-- nicht bekannt --",..: 7070 808 5579 9584 4922 4922 12492 1933 9584 4533 ...
 $ U_N_PROGRAMMCODE  : Factor w/ 14 levels "Bühne/Aufführung",..: 9 4 8 4 8 8 12 8 4 2 ...
 $ U_N_PROGRAMMSPARTE: Factor w/ 6 levels "Anderes","Fiction",..: 5 3 2 3 2 2 5 2 3 3 ...
 $ U_N_SENDUNGSFORMAT: Factor w/ 29 levels "Bühne / Aufführung",..: 20 9 19 4 19 19 24 19 4 16 ...
 $ U_N_GENRE         : Factor w/ 163 levels "Action / Abenteuer",..: 119 147 115 4 158 158 163 61 4 84 ...
 $ U_N_PRODUKTIONSART: Factor w/ 5 levels "Eigen-, Co-, Auftragsproduktion, Cofinanzierung",..: 1 1 3 1 3 3 1 3 1 1 ...
 $ U_N_HERKUNFTSLAND : Factor w/ 25 levels "afrikanische Länder",..: 16 16 25 16 15 15 16 25 16 16 ...
 $ GEPLANTE_SENDUNG_V: Factor w/ 12191 levels "-- nicht bekannt --",..: 6932 800 5470 9382 1518 9318 12119 1829 9382 4432 ...
 $ U_V_PROGRAMMCODE  : Factor w/ 13 levels "Bühne/Aufführung",..: 9 4 8 4 8 8 12 8 4 2 ...
 $ U_V_PROGRAMMSPARTE: Factor w/ 6 levels "Anderes","Fiction",..: 5 3 2 3 2 2 5 2 3 3 ...
 $ U_V_SENDUNGSFORMAT: Factor w/ 28 levels "Bühne / Aufführung",..: 20 9 19 4 19 19 24 19 4 16 ...
 $ U_V_GENRE         : Factor w/ 165 levels "Action / Abenteuer",..: 119 148 115 4 160 19 165 61 4 84 ...
 $ U_V_PRODUKTIONSART: Factor w/ 5 levels "Eigen-, Co-, Auftragsproduktion, Cofinanzierung",..: 1 1 3 1 3 3 1 3 1 1 ...
 $ U_V_HERKUNFTSLAND : Factor w/ 25 levels "afrikanische Länder",..: 16 16 25 16 15 9 16 25 16 16 ...
 $ ABGELEHNT         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ AKZEPTIERT        : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 1 2 2 2 ...

推荐答案

遇到了同样的问题,这里有一些我可以列出的提示.

Having faced the same issue, here are some tips I can list.

  1. 切换到另一种算法,例如从gbm 包.您最多可以处理 1024 个分类级别.如果您的预测器具有相当可辨别的参数,您还应该考虑概率方法,例如 naiveBayes.
  2. 将您的预测变量转换为虚拟变量,这可以通过使用 matrix.model 来完成.然后,您可以对该矩阵执行随机森林.
  3. 减少因子中的水平数.好吧,这听起来可能是一个愚蠢的建议,但是查看如此稀薄"的因素真的相关吗?您是否可以在更广泛的层面上汇总某些模式?
  1. Switch to another algorithm, for instance gradient boosting from gbm package. You can handle up to 1024 categorical levels. If your predictor has quite discriminant parameters, you should also consider probabilistic approaches such as naiveBayes.
  2. Transform your predictor into dummy variables, which can be done by using matrix.model. You can then perform a random forest over this matrix.
  3. Reduce the number of levels in your factor. Ok, that may sound like a silly advice, but is it really relevant to look at factors with such "thinness" ? Is it possible for you to aggregate some modalities at a broader level ?

编辑以添加 MODEL.MATRIX 示例

如前所述,这里有一个关于如何使用 model.matrix 将您的列转换为虚拟变量的示例.

As mentioned, here is an example on how to use model.matrix to transform your column into dummy variables.

mydf <- data.frame(var1 = factor(c("A", "A", "A", "B", "B", "C")),
                   var2 = factor(c("X", "Y", "X", "Y", "X", "Z")),
                   target = c(1,1,1,2,2,2))
dummyMat <- model.matrix(target ~ var1 + var2, mydf, # set contrasts.arg to keep all levels
                         contrasts.arg = list(var1 = contrasts(mydf$var1, contrasts = F), 
                                             var2 = contrasts(mydf$var2, contrasts = F))) 
mydf2 <- cbind(mydf, dummyMat[,c(2:ncol(dummyMat)]) # just removing intercept column

这篇关于R - 随机森林和超过 53 个类别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆