一种热编码训练,其值不存在于测试中 [英] One hot encoding train with values not present on test

查看:49
本文介绍了一种热编码训练,其值不存在于测试中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个存储为数据帧的训练和测试集.我正在尝试对我的数据集上的标称特征进行 One-hot 编码.但我有以下问题:

I have a train and test set stored as Dataframes. I am trying to One-hot encode nominal features on my dataset. But I have the following issues:

  1. 总共有 3 个分类特征,但我不知道每个特征的值是多少,因为数据集很大.
  2. 测试集具有训练集上不存在的值,因此当我进行单热编码时,训练集应该将向量标记为 0 以表示不可见的值.但正如我在 1 中提到的,我并不了解所有功能.
  3. 我发现我可以使用 df = pd.get_dummies(df, prefix_sep='_') 来做一个热编码,该命令适用于所有分类特征,但我注意到它移动了火车 DataFrame 末尾的新功能,我认为这是一个问题,因为我们不知道哪个功能的索引.还有第 2 个问题,新的火车/集合应该具有相同的索引.
  1. In total there are 3 categorical features, but I don't not know what the values of each feature because the dataset is large.
  2. The test set has values that are not present on the train set, so when I do one-hot encoding, the train set should have the vectors marked as 0 for the unseen values. But as I mentioned in 1, I don't know all the features.
  3. I found I can use df = pd.get_dummies(df, prefix_sep='_') to do the one hot encoding, the command works on all categorical features, but I noticed that it moved the new features to the end of the train DataFrame, which I think is a problem because we don't know the indices of which feature. Also there is issue number 2, the new train/set should have the same indices.

有没有自动的方法来做到这一点?或者图书馆?

Is there any automated way to do this? or a library perhaps?

编辑

感谢以下答案,我能够对许多功能执行一种热编码.但是下面的代码给出了以下问题:

Thanks to the answers below, I was able to perform one hot encoding on many features. But the codes below gave the following issues:

  1. 我认为 scikit-learn 去除了列标题并将结果作为数组而不是数据帧产生
  2. 由于特征被剥离,我们不知道哪个向量属于哪个特征.即使我执行 df_scaled = pd.DataFrame(ct.fit_transform(data2)) 将结果存储在数据帧中,创建的数据帧 df_scaled 也没有标题,尤其是当标题现在在预处理后发生了变化.也许 sklearn.preprocessing.OneHotEncoder 有一种方法可以跟踪新特征及其索引 ??
  1. I think scikit-learn strips the column headers and produced the result as an array and not as a DataFrame
  2. Since the features are striped away, we have no knowledge of which vector belongs to which feature. Even if I perform df_scaled = pd.DataFrame(ct.fit_transform(data2)) to have the results stored in a Dataframe, the created dataframe df_scaledhas no headers, especially when the headers now changed after the pre-processing. Perhaps sklearn.preprocessing.OneHotEncoder has a method which keeps track of new features and their indices ??

推荐答案

不要使用 pd.get_dummies,而是使用 sklearn.preprocessing.OneHotEncoder.它会自动从您的训练数据中获取所有名义类别,然后根据训练步骤中确定的类别对您的测试数据进行编码.如果测试数据中有新类别,它只会将您的数据编码为 0.

Instead of using pd.get_dummies, which has the drawbacks you identified, use sklearn.preprocessing.OneHotEncoder. It automatically fetches all nominal categories from your train data and then encodes your test data according to the categories identified in the training step. If there are new categories in the test data, it will just encode your data as 0's.

示例:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

x_train = np.array([["A1","B1","C1"],["A2","B1","C2"]])
x_test = np.array([["A1","B2","C2"]]) # As you can see, "B2" is a new attribute for column B

ohe = OneHotEncoder(handle_unknown = 'ignore') #ignore tells the encoder to ignore new categories by encoding them with 0's
ohe.fit(x_train)
print(ohe.transform(x_train).toarray())
>>> array([[1., 0., 1., 1., 0.],
           [0., 1., 1., 0., 1.]])

要按训练集中的列获取类别的摘要,请执行以下操作:

To get a summary of the categories by column in the train set, do:

print(ohe.categories_)
>>> [array(['A1', 'A2'], dtype='<U2'), 
     array(['B1'], dtype='<U2'), 
     array(['C1', 'C2'], dtype='<U2')]

要将一个热编码列映射到类别,请执行以下操作:

To map one hot encoded columns to categories, do:

print(ohe.get_feature_names())
>>> ['x0_A1' 'x0_A2' 'x1_B1' 'x2_C1' 'x2_C2']

最后,这是编码器处理新测试数据的方式:

Finally, this is how the encoder works on new test data:

print(ohe.transform(x_test).toarray())
>>> [[1. 0. 0. 0. 1.]] # 1 for A1, 0 for A2, 0 for B1, 0 for C1, 1 for C2

编辑:

您似乎担心在进行编码后会丢失标签.回到这些实际上很容易,只需将答案包装在数据框中并从 ohe.get_feature_names() 中指定列名:

You seem to be worried about the fact that you lose the labels after doing the encoding. It is actually very easy to get back to these, just wrap the answer in a dataframe and specify the column names from ohe.get_feature_names():

pd.DataFrame(ohe.transform(x_test).toarray(), columns = ohe.get_feature_names())

这篇关于一种热编码训练,其值不存在于测试中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆