scikit学习伪变量的创建 [英] scikit learn creation of dummy variables
问题描述
在scikit-learn中,我需要哪些模型将分类变量分解为虚拟二进制字段?
In scikit-learn, which models do I need to break categorical variables into dummy binary fields?
例如,如果列为political-party
,并且值为democrat
,republican
和green
,则对于许多算法,您必须将其分为三列,其中每一行只能容纳一个1
,其余所有必须为0
.
For example, if the column is political-party
, and the values are democrat
, republican
and green
, for many algorithms, you have to break this into three columns where each row can only hold one 1
, and all the rest must be 0
.
这避免了强制离散化[democrat, republican and green]
=> [0, 1, 2]
时不存在的序数,因为democrat
和green
实际上并不比另一对更远".
This avoids enforcing an ordinality that doesn't exist when discretizing [democrat, republican and green]
=> [0, 1, 2]
, since democrat
and green
aren't actually "farther" away then another pair.
此scikit-learn中的哪些算法需要转换为虚拟变量?对于那些不是的算法,它不会受到伤害,对吧?
For which algorithms in scikit-learn is this transformation into dummy variables necessary? And for those algorithms that aren't, it can't hurt, right?
推荐答案
此scikit-learn中的哪些算法需要转换为虚拟变量?对于那些不是的算法,它不会受到伤害,对吧?
For which algorithms in scikit-learn is this transformation into dummy variables necessary? And for those algorithms that aren't, it can't hurt, right?
sklearn中的所有算法(基于树的方法除外)都需要对名义分类变量进行一热编码(也称为伪变量).
All algorithms in sklearn with the notable exception of tree-based methods require one-hot encoding (also known as dummy variables) for nominal categorical variables.
对基数非常大的分类特征使用伪变量可能会损害基于树的方法,特别是通过在特征拆分采样器中引入偏差来破坏基于随机树的方法.基于树的方法通常可以很好地与分类特征的基本整数编码配合使用.
Using dummy variables for categorical features with very large cardinalities might hurt tree-based methods, especially randomized tree methods by introducing a bias in the feature split sampler. Tree-based method tend to work reasonably well with a basic integer encoding of categorical features.
这篇关于scikit学习伪变量的创建的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!