如何在sklearn中编码分类特征? [英] How to encode categorical features in sklearn?

查看:185
本文介绍了如何在sklearn中编码分类特征?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有41个要素(从0到40列)的数据集,其中7个是分类的。此类别集分为两个子集:

I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset:


  • 字符串类型的子集(列功能1、2、3)

  • int类型的子集,二进制形式为0或1(列功能6、11、20、21)

此外,(字符串类型的)列特征1、2和3具有基数3、66和11。
在这种情况下,我必须对它们进行编码以使用支持向量机算法。
这是我拥有的代码:

Furthermore the column-features 1, 2 and 3 (of string type) have cardinality 3, 66 and 11 respectively. In this context I have to encode them to use support vector machine algorithm. This is the code that I have:

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import feature_extraction

df = pd.read_csv("train.csv")
datanumpy = df.as_matrix()
X = datanumpy[:, 0:40]  # select columns 1 through 41 (the features)
y = datanumpy[:, 41]  # select column 42 (the labels)

我不知道使用 DictVectorizer()还是是否更好OneHotEncoder() [出于上述原因],以及[code> X 矩阵主要用于哪种方式(在代码方面)我有。
还是我应该简单地给字符串类型子集中的每个基数分配一个数字(因为它们的基数很高,所以我的特征空间将成倍增加)?

I don't know if is better to use DictVectorizer() or OneHotEncoder() [for the reasons that I exposed above], and mostly in which way use them [in term of code] with the X matrix that I have. Or should I simply assign a number to each cardinality in the subset of string type (since they have high cardinality and so my feature space will increase exponentially)?

编辑
关于int类型的子集,我猜最好的选择是保持列特征不变(不要将其传递给任何编码器)
对于具有高基数的字符串类型子集,问题仍然存在。

EDIT With respect to subset of int type I guess that the best choice is to keep the column-features as they are (don't pass them to any encoder) The problem persist for subset of string type with high cardinality.

推荐答案

到目前为止,这是最简单的:

This is by far the easiest:

 df = pd.get_dummies(df, drop_first=True)

如果内存溢出或速度太慢,请减小基数:

If you get a memory overflow or it is too slow then reduce the cardinality:

top = df[col].isin(df[col].value_counts().index[:10])
df.loc[~top, col] = "other"

这篇关于如何在sklearn中编码分类特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆