获取ValueError:使用scikit Learn的LabelEncoder时y包含新标签 [英] Getting ValueError: y contains new labels when using scikit learn's LabelEncoder

查看:339
本文介绍了获取ValueError:使用scikit Learn的LabelEncoder时y包含新标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类似的系列:

df['ID'] = ['ABC123', 'IDF345', ...]

我正在使用scikit的LabelEncoder将其转换为数值,以馈入RandomForestClassifier.

I'm using scikit's LabelEncoder to convert it to numerical values to be fed into the RandomForestClassifier.

在培训期间,我的工作如下:

During the training, I'm doing as follows:

le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID) 

但是,现在为了进行测试/预测,当我传入新数据时,我想基于le_id从此数据中转换"ID",即,如果存在相同的值,则根据上述标签编码器对其进行转换,否则请分配一个新的数值.

But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_id i.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.

在测试文件中,我正在执行以下操作:

In the test file, I was doing as follows:

new_df['ID'] = le_dpid.transform(new_df.ID)

但是,出现以下错误:ValueError: y contains new labels

But, I'm getting the following error: ValueError: y contains new labels

我该如何解决?谢谢!

更新:

因此,我要做的任务是使用以下示例(例如)作为训练数据,并预测新的BankNum ID组合的'High', 'Mod', 'Low'值.该模型应从训练数据集中学习给出高",给出低"的特征.例如,当存在多个具有相同BankNum和不同ID的条目时,将在高"下面给出.

So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low' values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.

df = 

BankNum   | ID    | Labels

0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low

然后通过类似以下内容进行预测:

And then predict it on something like:

BankNum   |  ID | 

00982222  | AB999 | 
00982222  | AB999 |
00981111  | AB890 |

我正在做这样的事情:

df['BankNum'] = df.BankNum.astype(np.float128)

    le_id = LabelEncoder()
    df['ID'] = le_id.fit_transform(df.ID)

X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
    clf = RandomForestClassifier(random_state=42, n_estimators=140)
    clf.fit(X_train, y_train)

推荐答案

我认为错误消息很明确:您的测试数据集包含ID标签,这些标签尚未包含在训练数据集中.对于此项目,LabelEncoder找不到合适的数值来表示.有几种方法可以解决此问题.您可以尝试平衡您的数据集,以确保每个标签不仅存在于您的测试中,而且还存在于您的训练数据中.否则,您可以尝试在此处中提出的一种想法.

I think the error message is very clear: Your test dataset contains ID labels which have not been included in your training data set. For this items, the LabelEncoder can not find a suitable numeric value to represent. There are a few ways to solve this problem. You can either try to balance your data set, so that you are sure that each label is not only present in your test but also in your training data. Otherwise, you can try to follow one of the ideas presented here.

一种可能的解决方案是,您从头开始搜索数据集,获取所有唯一的ID值的列表,在该列表上训练LabelEncoder,并将其余代码保持不变现在就是了.

One of the possibles solutions is, that you search through your data set at the beginning, get a list of all unique ID values, train the LabelEncoder on this list, and keep the rest of your code just as it is at the moment.

另一种可能的解决方案是,检查测试数据是否只有在训练过程中看到的标签.如果有一个新标签,则必须将其设置为某个后备值,例如unknown_id(或类似的东西).这样,您将所有未知的新ID放在一个类中;对于此项目,预测将失败,但是您可以像现在一样使用其余代码.

An other possible solution is, to check that the test data have only labels which have been seen in the training process. If there is a new label, you have to set it to some fallback value like unknown_id (or something like this). Doin this, you put all new, unknown IDs in one class; for this items the prediction will then fail, but you can use the rest of your code as it is now.

这篇关于获取ValueError:使用scikit Learn的LabelEncoder时y包含新标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆