XGBoost分类变量：实体模型与编码 [英] XGBoost Categorical Variables: Dummification vs encoding

查看：564 发布时间：2020/9/30 0:28:03 python categorical-data xgboost

本文介绍了XGBoost分类变量：实体模型与编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用 XGBoost 时，我们需要将分类变量转换为数字变量。

When using XGBoost we need to convert categorical variables into numeric.

性能上是否会有差异/评估方法之间的度量标准：

Would there be any difference in performance/evaluation metrics between the methods of:

将类别变量归类

从以下类别编码您的类别变量例如（a，b，c）到（1,2,3）

ALSO：

是否会有任何理由不使用方法2，例如使用 labelencoder ？

Would there be any reasons not to go with method 2 by using for example labelencoder?

推荐答案

xgboost 仅处理数字列。

如果有功能 [a，b，b，c] 描述了分类变量（即没有数值关系）

if you have a feature [a,b,b,c] which describes a categorical variable (i.e. no numeric relationship)

使用 LabelEncoder ，您将拥有以下内容：

Using LabelEncoder you will simply have this:

array([0, 1, 1, 2])

Xgboost 会错误地将此功能解释为具有数字关系！这只会映射每个字符串（'a'，'b'，'c'）到一个整数，仅此而已。

Xgboost will wrongly interpret this feature as having a numeric relationship! This just maps each string ('a','b','c') to an integer, nothing more.

正确方法

使用 OneHotEncoder 您最终会了解到：

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

这是 xgboost 或任何其他机器学习工具的分类变量的正确表示形式。

This is the proper representation of a categorical variable for xgboost or any other machine learning tool.

Pandas get_dummies 是创建伪变量的好工具（

Pandas get_dummies is a nice tool for creating dummy variables (which is easier to use, in my opinion).

上述问题中的方法2无法正确表示数据

这篇关于XGBoost分类变量：实体模型与编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

XGBoost分类变量：实体模型与编码 [英] XGBoost Categorical Variables: Dummification vs encoding

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

XGBoost分类变量：实体模型与编码 [英] XGBoost Categorical Variables: Dummification vs encoding

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭