将缺失值保留为“NaN"的 LabelEncoder [英] LabelEncoder that keeps missing values as 'NaN'
问题描述
我正在尝试使用标签编码器将分类数据转换为数值.
I am rying to use the label encoder in orrder to convert categorical data into numeric values.
我需要一个 LabelEncoder 将我的缺失值保留为NaN",以便之后使用 Imputer.所以我想在像这样标记后使用掩码来替换原始数据框
I needed a LabelEncoder that keeps my missing values as 'NaN' to use an Imputer afterwards. So I would like to use a mask to replace form the original data frame after labelling like this
df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
A B C
0 x 1 2.0
1 NaN 6 1.0
2 z 9 NaN
dfTmp = df
mask = dfTmp.isnull()
A B C
0 False False False
1 True False False
2 False False True
所以我得到一个带有真/假值的数据框
So I get a dataframe with True/false value
然后,在创建编码器:
df = df.astype(str).apply(LabelEncoder().fit_transform)
然后我该如何继续编码这些值?
How can I proceed then, in orfer to encoder these values?
谢谢
推荐答案
第一个问题是:您希望对每一列分别进行编码还是使用一种编码对它们全部进行编码?
The first question is: do you wish to encode each column separately or encode them all with one encoding?
表达式 df = df.astype(str).apply(LabelEncoder().fit_transform)
暗示您分别对所有列进行编码.
The expression df = df.astype(str).apply(LabelEncoder().fit_transform)
implies that you encode all the columns separately.
That case you can do the following:
df = df.apply(lambda series: pd.Series(
LabelEncoder().fit_transform(series[series.notnull()]),
index=series[series.notnull()].index
))
print(df)
Out:
A B C
0 0.0 0 1.0
1 NaN 1 0.0
2 1.0 2 NaN
下面解释它是如何工作的.但是,首先,我将介绍此解决方案的几个缺点.
the explenation how it works below. But, for starters, I'll tell about a couple of drawbacks of this solution.
缺点
首先,列有混合类型:如果列包含 NaN
值,则列的类型为 float
,因为 nan 在 python 中是浮点数.
Drawbacks
First, there are a mixed types of columns: if a column contains a NaN
value, then column has a type float
, because nan's are floats in python.
df.dtypes
A float64
B int64
C float64
dtype: object
对于标签来说似乎没有意义.好的,稍后您可以忽略所有 nan 并将其余部分转换为整数.
It seems to be meaningless for labels. Okay, later you can ignore all the nan's and covert the rest to integer.
第二点是:可能你需要记住一个 LabelEncoder
- 因为通常需要它做,例如,逆变换.但是这个解决方案不会记住编码器,你没有这样的变量.
The second point is: probably you need to memorize a LabelEncoder
- because often it's required to do, for instance, inverse transform. But this solution doesn't memorize encoders, you have no such varaible.
一个简单、明确的解决方案是:
A simple, explicit solution is:
encoders = dict()
for col_name in df.columns:
series = df[col_name]
label_encoder = LabelEncoder()
df[col_name] = pd.Series(
label_encoder.fit_transform(series[series.notnull()]),
index=series[series.notnull()].index
)
encoders[col_name] = label_encoder
print(df)
Out:
A B C
0 0.0 0 1.0
1 NaN 1 0.0
2 1.0 2 NaN
- 更多代码,但结果相同
- more code, but result is the same
print(encoders)
Out
{'A': LabelEncoder(), 'B': LabelEncoder(), 'C': LabelEncoder()}
-此外,还可以使用编码器.逆变换(之前应该去掉 nan!):
- also, encoders are available. Inverse transform (should drop nan's before!) too:
encoders['B'].inverse_transform(df['B'])
Out:
array([1, 6, 9])
此外,一些选项,如编码器的一些注册表超类也可用,它们与第一个解决方案兼容,但更容易遍历列.
Also, some options like some registry superclass for encoders also available and they are compatible with the first solution, but easier to iterate through a columns.
工作原理
df.apply(lambda series: ...)
应用一个返回 pd.Series
到每一列的函数;因此,它返回一个带有新值的数据框.
The df.apply(lambda series: ...)
applies a function which returns pd.Series
to each column; so, it returns a dataframe with a new values.
分步表达:
pd.Series(
LabelEncoder().fit_transform(series[series.notnull()]),
index=series[series.notnull()].index
)
- series[series.notnull()]
删除 NaN
值,然后将其余值提供给 fit_transform
.
- series[series.notnull()]
drop NaN
values, then feeds the rest to the fit_transform
.
- 当标签编码器返回一个 numpy.array
并抛出一个索引时,index=series[series.notnull()].index
恢复它以连接它正确.如果不做索引:
- as the label encoder returns a numpy.array
and throws out an index, index=series[series.notnull()].index
restores it to concatenate it correctly. If don't do indexing:
print(df)
Out:
A B C
0 x 1 2.0
1 NaN 6 1.0
2 z 9 NaN
df = df.apply(lambda series: pd.Series(
LabelEncoder().fit_transform(series[series.notnull()]),
))
print(df)
Out:
A B C
0 0.0 0 1.0
1 1.0 1 0.0
2 NaN 2 NaN
- 值从正确位置偏移 - 甚至可能发生 IndexError
.
- values shift from correct positions - and even an IndexError
may occur.
所有列的单一编码器
那种情况,堆栈数据帧,适合编码器,然后解堆栈
That case, stack dataframe, fit encodet, then unstack it
series_stack = df.stack().astype(str)
label_encoder = LabelEncoder()
df = pd.Series(
label_encoder.fit_transform(series_stack),
index=series_stack.index
).unstack()
print(df)
Out:
A B C
0 5.0 0.0 2.0
1 NaN 3.0 1.0
2 6.0 4.0 NaN
- 由于 series_stack
是包含 NaN
的 pd.Series
,DataFrame 中的所有值都是浮点数,因此您可能更喜欢转换它.
- as the series_stack
is pd.Series
containing NaN
's, all values from the DataFrame is floats, so you may prefer to convert it.
希望有帮助.
这篇关于将缺失值保留为“NaN"的 LabelEncoder的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!