XGBoost/CatBoost 中具有大量类别的分类变量 [英] Categorical variables with large amounts of categories in XGBoost/CatBoost

查看：194 发布时间：2021/7/2 20:07:21 machine-learning random-forest xgboost categorical-data catboost

本文介绍了XGBoost/CatBoost 中具有大量类别的分类变量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个关于随机森林的问题.想象一下，我有用户与项目交互的数据.项目的数量很大，大约 10 000.我的随机森林输出应该是用户可能与之交互的项目(如推荐系统).对于任何用户，我想使用一个功能来描述用户过去与之交互的项目.然而，将分类产品特征映射为单热编码似乎非常低效，因为用户与最多不超过几百个项目交互，有时只有 5 个.

I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.

当输入特征之一是具有约 10 000 个可能值的分类变量而输出是具有约 10 000 个可能值的分类变量时，您将如何构建随机森林?我应该将 CatBoost 与分类功能一起使用吗?或者我应该使用 one-hot 编码，如果是这样，您认为 XGBoost 还是 CatBoost 更好?

How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?

XGBoost/CatBoost 中具有大量类别的分类变量 [英] Categorical variables with large amounts of categories in XGBoost/CatBoost

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

XGBoost/CatBoost 中具有大量类别的分类变量 [英] Categorical variables with large amounts of categories in XGBoost/CatBoost

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭