不推荐使用OneHotEncoder categorical_features,如何转换特定列 [英] OneHotEncoder categorical_features deprecated, how to transform specific column

查看:108
本文介绍了不推荐使用OneHotEncoder categorical_features,如何转换特定列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将独立字段从字符串转换为算术符号.我正在使用OneHotEncoder进行转换.我的数据集有许多独立的列,其中一些是:

 国家(地区)|年龄--------------------------德国|23西班牙|25德国|24意大利|30 

我必须像编码国家列那样

  0 |1 |2 |3--------------------------------------1 |0 |0 |230 |1 |0 |251 |0 |0 |240 |0 |1 |30 

我通过使用OneHotEncoder成功获得了欲望转换

 #编码分类数据从sklearn.preprocessing导入LabelEncoderlabelencoder_X = LabelEncoder()X [:,0] = labelencoder_X.fit_transform(X [:,0])#我们是虚拟编码,因为机器学习算法将是#与西班牙">之类的值混淆德国>法国从sklearn.preprocessing导入OneHotEncoderonehotencoder = OneHotEncoder(categorical_features = [0])X = onehotencoder.fit_transform(X).toarray() 

现在,我收到了折旧消息,以使用 categories ='auto'.如果我这样做,那么将对所有独立列(例如国家/地区,年龄,工资等)进行转换.

如何仅在数据集第0列上实现转换?

解决方案

实际上有2条警告:

FutureWarning:整数数据的处理将在版本中更改0.22.当前,类别是根据范围[0,max(values)]确定的,而将来,它们将基于范围[0,max(values)]确定.独特的价值观.如果您想要将来的行为并对此保持沉默警告,您可以指定"categories ='auto'".如果您使用了在此OneHotEncoder之前的LabelEncoder将类别转换为整数,那么您现在可以直接使用OneHotEncoder.

第二个:

版本0.20中已弃用"categorical_features"关键字,并且将在0.22中删除.您可以改用ColumnTransformer.
改为使用ColumnTransformer.",DeprecationWarning)

将来,除非您要使用"categories ='auto'",否则不应直接在OneHotEncoder中定义列.第一条消息还告诉您直接使用OneHotEncoder,而无需先使用LabelEncoder.最后,第二条消息告诉您使用ColumnTransformer,就像用于列转换的管道一样.

以下是您的案例的等效代码:

来自sklearn.compose的

 导入ColumnTransformerct = ColumnTransformer([("Name_Of_Your_Step",OneHotEncoder(),[0])],restder ="passthrough"))#最后一个arg([0])是您要在此步骤中转换的列的列表ct.fit_transform(X) 

另请参见: ColumnTransformer文档

对于上述示例;

编码分类数据(基本上将文本更改为数字数据,即国家/地区名称)

从sklearn.preprocessing的

 导入LabelEncoder,OneHotEncoder从sklearn.compose导入ColumnTransformer#编码国家/地区列labelencoder_X = LabelEncoder()X [:,0] = labelencoder_X.fit_transform(X [:,0])ct = ColumnTransformer([("Country",OneHotEncoder(),[0])],其余='passthrough')X = ct.fit_transform(X) 

I need to transform the independent field from string to arithmetical notation. I am using OneHotEncoder for the transformation. My dataset has many independent columns of which some are as:

Country     |    Age       
--------------------------
Germany     |    23
Spain       |    25
Germany     |    24
Italy       |    30 

I have to encode the Country column like

0     |    1     |     2     |       3
--------------------------------------
1     |    0     |     0     |      23
0     |    1     |     0     |      25
1     |    0     |     0     |      24 
0     |    0     |     1     |      30

I succeed to get the desire transformation via using OneHotEncoder as

#Encoding the categorical data
from sklearn.preprocessing import LabelEncoder

labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])

#we are dummy encoding as the machine learning algorithms will be
#confused with the values like Spain > Germany > France
from sklearn.preprocessing import OneHotEncoder

onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()

Now I'm getting the depreciation message to use categories='auto'. If I do so the transformation is being done for the all independent columns like country, age, salary etc.

How to achieve the transformation on the dataset 0th column only?

解决方案

There is actually 2 warnings :

FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values. If you want the future behaviour and silence this warning, you can specify "categories='auto'". In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.

and the second :

The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)

In the future, you should not define the columns in the OneHotEncoder directly, unless you want to use "categories='auto'". The first message also tells you to use OneHotEncoder directly, without the LabelEncoder first. Finally, the second message tells you to use ColumnTransformer, which is like a Pipe for columns transformations.

Here is the equivalent code for your case :

from sklearn.compose import ColumnTransformer 
ct = ColumnTransformer([("Name_Of_Your_Step", OneHotEncoder(),[0])], remainder="passthrough")) # The last arg ([0]) is the list of columns you want to transform in this step
ct.fit_transform(X)    

See also : ColumnTransformer documentation

For the above example;

Encoding Categorical data (Basically Changing Text to Numerical data i.e, Country Name)

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
#Encode Country Column
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
ct = ColumnTransformer([("Country", OneHotEncoder(), [0])], remainder = 'passthrough')
X = ct.fit_transform(X)

这篇关于不推荐使用OneHotEncoder categorical_features,如何转换特定列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆