标签编码器编码缺失值 [英] label-encoder encoding missing values
问题描述
我正在使用标签编码器将分类数据转换为数值.
I am using the label encoder to convert categorical data into numeric values.
LabelEncoder如何处理缺失值?
How does LabelEncoder handle missing values?
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
le.fit_transform(a)
输出:
array([1, 2, 3, 0, 4, 1])
对于上面的示例,标签编码器将NaN值更改为类别.我怎么知道哪个类别代表缺失值?
For the above example, label encoder changed NaN values to a category. How would I know which category represents missing values?
推荐答案
请勿在缺少值的情况下使用LabelEncoder
.我不知道您正在使用哪个版本的scikit-learn
,但是在0.17.1中,您的代码将引发TypeError: unorderable types: str() > float()
.
Don't use LabelEncoder
with missing values. I don't know which version of scikit-learn
you're using, but in 0.17.1 your code raises TypeError: unorderable types: str() > float()
.
如您所见,在来源,它对数据进行编码时使用numpy.unique
,如果发现缺少值,则会引发TypeError
.如果要编码缺失值,请首先将其类型更改为字符串:
As you can see in the source it uses numpy.unique
against the data to encode, which raises TypeError
if missing values are found. If you want to encode missing values, first change its type to a string:
a[pd.isnull(a)] = 'NaN'
这篇关于标签编码器编码缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!