sklearn 随机森林可以直接处理分类特征吗? [英] Can sklearn random forest directly handle categorical features?

查看:152
本文介绍了sklearn 随机森林可以直接处理分类特征吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个分类特征,颜色,它接受值

Say I have a categorical feature, color, which takes the values

['红色','蓝色','绿色','橙色'],

['red', 'blue', 'green', 'orange'],

我想用它来预测随机森林中的某些东西.如果我对它进行单热编码(即我将其更改为四个虚拟变量),我如何告诉 sklearn 这四个虚拟变量实际上是一个变量?具体来说,当 sklearn 随机选择要在不同节点上使用的特征时,它应该包括红色、蓝色、绿色和橙色的假人,或者不应该包括其中任何一个.

and I want to use it to predict something in a random forest. If I one-hot encode it (i.e. I change it to four dummy variables), how do I tell sklearn that the four dummy variables are really one variable? Specifically, when sklearn is randomly selecting features to use at different nodes, it should either include the red, blue, green and orange dummies together, or it shouldn't include any of them.

我听说没有办法做到这一点,但我想一定有一种方法可以处理分类变量,而不必随意将它们编码为数字或类似的东西.

I've heard that there's no way to do this, but I'd imagine there must be a way to deal with categorical variables without arbitrarily coding them as numbers or something like that.

推荐答案

不,没有.有人正在研究这个,并且补丁可能有一天会合并到主线中,但是现在除了虚拟(one-hot)编码外,scikit-learn 中不支持分类变量.

No, there isn't. Somebody's working on this and the patch might be merged into mainline some day, but right now there's no support for categorical variables in scikit-learn except dummy (one-hot) encoding.

这篇关于sklearn 随机森林可以直接处理分类特征吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆