获取 ValueError: y 使用 scikit learn 的 LabelEncoder 时包含新标签 [英] Getting ValueError: y contains new labels when using scikit learn's LabelEncoder

查看:22
本文介绍了获取 ValueError: y 使用 scikit learn 的 LabelEncoder 时包含新标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类似的系列:

df['ID'] = ['ABC123', 'IDF345', ...]

我正在使用 scikit 的 LabelEncoder 将其转换为数值以输入 RandomForestClassifier.

I'm using scikit's LabelEncoder to convert it to numerical values to be fed into the RandomForestClassifier.

在培训期间,我做如下:

During the training, I'm doing as follows:

le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID) 

但是,现在为了测试/预测,当我传入新数据时,我想根据 le_id 从此数据转换ID",即,如果存在相同的值,则根据给上面的标签编码器,否则分配一个新的数值.

But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_id i.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.

在测试文件中,我是这样做的:

In the test file, I was doing as follows:

new_df['ID'] = le_dpid.transform(new_df.ID)

但是,我收到以下错误:ValueError: y contains new labels

But, I'm getting the following error: ValueError: y contains new labels

我该如何解决这个问题??谢谢!

How do I fix this?? Thanks!

更新:

所以我的任务是使用以下(例如)作为训练数据并预测新 BankNum、ID 组合的 'High'、'Mod'、'Low' 值.模型应该学习从训练数据集中给出高"和低"的特征.例如,当有多个条目具有相同的 BankNum 和不同的 ID 时,会在下面给出一个高".

So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low' values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.

df = 

BankNum   | ID    | Labels

0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low

然后根据以下内容对其进行预测:

And then predict it on something like:

BankNum   |  ID | 

00982222  | AB999 | 
00982222  | AB999 |
00981111  | AB890 |

我正在做这样的事情:

df['BankNum'] = df.BankNum.astype(np.float128)

    le_id = LabelEncoder()
    df['ID'] = le_id.fit_transform(df.ID)

X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
    clf = RandomForestClassifier(random_state=42, n_estimators=140)
    clf.fit(X_train, y_train)

推荐答案

我认为错误信息非常清楚:您的测试数据集包含未包含在您的训练数据集中的 ID 标签.对于此项,LabelEncoder 找不到合适的数值来表示.有几种方法可以解决这个问题.您可以尝试平衡数据集,以确保每个标签不仅存在于您的测试中,而且存在于您的训练数据中.否则,您可以尝试遵循此处提出的想法之一.

I think the error message is very clear: Your test dataset contains ID labels which have not been included in your training data set. For this items, the LabelEncoder can not find a suitable numeric value to represent. There are a few ways to solve this problem. You can either try to balance your data set, so that you are sure that each label is not only present in your test but also in your training data. Otherwise, you can try to follow one of the ideas presented here.

一种可能的解决方案是,在开始时搜索数据集,获取所有唯一 ID 值的列表,在此列表上训练 LabelEncoder,并保持您的其余代码保持原样.

One of the possibles solutions is, that you search through your data set at the beginning, get a list of all unique ID values, train the LabelEncoder on this list, and keep the rest of your code just as it is at the moment.

另一种可能的解决方案是,检查测试数据是否只有在训练过程中见过的标签.如果有新标签,您必须将其设置为一些后备值,例如 unknown_id(或类似的东西).这样做,您将所有新的、未知的 ID 放在一个类中;对于这些项目,预测将失败,但您可以像现在一样使用其余代码.

An other possible solution is, to check that the test data have only labels which have been seen in the training process. If there is a new label, you have to set it to some fallback value like unknown_id (or something like this). Doin this, you put all new, unknown IDs in one class; for this items the prediction will then fail, but you can use the rest of your code as it is now.

这篇关于获取 ValueError: y 使用 scikit learn 的 LabelEncoder 时包含新标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆