适用于onehotencoder的sklearn蒙版不起作用 [英] sklearn mask for onehotencoder does not work

查看:79
本文介绍了适用于onehotencoder的sklearn蒙版不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑如下数据:

from sklearn.preprocessing import OneHotEncoder
import numpy as np
dt = 'object, i4, i4'
d = np.array([('aaa', 1, 1), ('bbb', 2, 2)], dtype=dt)  

我想使用OHE功能排除文本列.

I want to exclude the text column using the OHE functionality.

为什么以下各项不起作用?

Why does the following not work?

ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool))       
ohe.fit(d)
ValueError: could not convert string to float: 'bbb'

文档中说:

It says in the documentation:

categorical_features: "all" or array of indices or mask :
  Specify what features are treated as categorical.
   ‘all’ (default): All features are treated as categorical.
   array of indices: Array of categorical feature indices.
   mask: Array of length n_features and with dtype=bool.

我正在使用蒙版,但它仍尝试转换为浮点型.

I'm using a mask, yet it still tries to convert to float.

即使使用

ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool), 
                    dtype=dt)        
ohe.fit(d)

相同的错误.

并且在索引数组"的情况下:

And also in the case of "array of indices":

ohe = OneHotEncoder(categorical_features=np.array([1, 2]), dtype=dt)        
ohe.fit(d)

推荐答案

您应该了解,Scikit-Learn中的所有估计量都仅用于数字输入.因此,从这种角度来看,以这种形式保留文本列是没有意义的.您必须将文本列转换为数字形式,或者摆脱它.

You should understand that all estimators in Scikit-Learn were designed only for numerical inputs. Thus from this point of view there is no sense to leave text column in this form. You have to transform that text column in something numerical, or get rid of it.

如果您是从Pandas DataFrame获取数据集的,则可以看一下这个小包装: https://github.com/paulgb/sklearn-pandas .它将帮助您同时转换所有需要的列(或以数字形式保留一些行)

If you obtained your dataset from Pandas DataFrame - you can take a look at this small wrapper: https://github.com/paulgb/sklearn-pandas. It will help you to transform all needed columns simultaneously (or leave some of rows in numerical form)

import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame({'text':['aaa', 'bbb'], 'number_1':[1, 1], 'number_2':[2, 2]})

#    number_1  number_2 text
# 0         1         2  aaa
# 1         1         2  bbb

# SomeEncoder here must be any encoder which will help you to get
# numerical representation from text column
mapper = DataFrameMapper([
    ('text', SomeEncoder),
    (['number_1', 'number_2'], OneHotEncoder())
])
mapper.fit_transform(data)

这篇关于适用于onehotencoder的sklearn蒙版不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆