如何在sklearn中编码分类变量? [英] How to encode a categorical variable in sklearn?

查看:459
本文介绍了如何在sklearn中编码分类变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用UCI存储库中的汽车评估数据集,我想知道是否存在方便的方法来对sklearn中的分类变量进行二值化.一种方法是使用LabelBinarizer的DictVectorizer,但在这里我得到了k个不同的特征,而为了避免共线性化,您应该只有k-1个. 我想我可以编写自己的函数并删除一列,但这种簿记工作很繁琐,是否有一种简单的方法来执行此类转换并因此获得稀疏矩阵?

I'm trying to use the car evaluation dataset from the UCI repository and I wonder whether there is a convenient way to binarize categorical variables in sklearn. One approach would be to use the DictVectorizer of LabelBinarizer but here I'm getting k different features whereas you should have just k-1 in order to avoid collinearization. I guess I could write my own function and drop one column but this bookkeeping is tedious, is there an easy way to perform such transformations and get as a result a sparse matrix?

推荐答案

DictVectorizer是生成分类变量的一键编码的推荐方法.您可以使用sparse参数创建一个稀疏的CSR矩阵,而不是一个密集的numpy数组.我通常不关心多重共线性,也没有注意到我倾向于使用的方法(即LinearSVC,SGDClassifier,基于树的方法)存在问题.

DictVectorizer is the recommended way to generate a one-hot encoding of categorical variables; you can use the sparse argument to create a sparse CSR matrix instead of a dense numpy array. I usually don't care about multicollinearity and I haven't noticed a problem with the approaches that I tend to use (i.e. LinearSVC, SGDClassifier, Tree-based methods).

打补丁DictVectorizer以便为每个分类功能删除一列不是问题-您只需在fit方法的末尾从DictVectorizer.vocabulary中删除一个术语. (随时欢迎拉动请求!)

It shouldn't be a problem to patch the DictVectorizer to drop one column per categorical feature - you simple need to remove one term from DictVectorizer.vocabulary at the end of the fit method. (Pull requests are always welcome!)

这篇关于如何在sklearn中编码分类变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆