scikit学习伪变量的创建 [英] scikit learn creation of dummy variables

查看:75
本文介绍了scikit学习伪变量的创建的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在scikit-learn中,我需要哪些模型将分类变量分解为虚拟二进制字段?

In scikit-learn, which models do I need to break categorical variables into dummy binary fields?

例如,如果列为political-party,并且值为democratrepublicangreen,则对于许多算法,您必须将其分为三列,其中每一行只能容纳一个1,其余所有必须为0.

For example, if the column is political-party, and the values are democrat, republican and green, for many algorithms, you have to break this into three columns where each row can only hold one 1, and all the rest must be 0.

这避免了强制离散化[democrat, republican and green] => [0, 1, 2]时不存在的序数,因为democratgreen实际上并不比另一对更远".

This avoids enforcing an ordinality that doesn't exist when discretizing [democrat, republican and green] => [0, 1, 2], since democrat and green aren't actually "farther" away then another pair.

此scikit-learn中的哪些算法需要转换为虚拟变量?对于那些不是的算法,它不会受到伤害,对吧?

For which algorithms in scikit-learn is this transformation into dummy variables necessary? And for those algorithms that aren't, it can't hurt, right?

推荐答案

此scikit-learn中的哪些算法需要转换为虚拟变量?对于那些不是的算法,它不会受到伤害,对吧?

For which algorithms in scikit-learn is this transformation into dummy variables necessary? And for those algorithms that aren't, it can't hurt, right?

sklearn中的所有算法(基于树的方法除外)都需要对名义分类变量进行一热编码(也称为伪变量).

All algorithms in sklearn with the notable exception of tree-based methods require one-hot encoding (also known as dummy variables) for nominal categorical variables.

对基数非常大的分类特征使用伪变量可能会损害基于树的方法,特别是通过在特征拆分采样器中引入偏差来破坏基于随机树的方法.基于树的方法通常可以很好地与分类特征的基本整数编码配合使用.

Using dummy variables for categorical features with very large cardinalities might hurt tree-based methods, especially randomized tree methods by introducing a bias in the feature split sampler. Tree-based method tend to work reasonably well with a basic integer encoding of categorical features.

这篇关于scikit学习伪变量的创建的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆